Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Read data into R from external formats (e.g., Excel, .csv, SPSS)
  • Use dplyr functions in R to manipulate datasets
  • Use tidyr functions in R to organize and re-structure datasets

2 Reading in external data

2.1 Functions to read in external data

  • Comma-separated values (CSV) file: Use read.csv() (built-in)
  • Excel: Use read_excel() function from readxl package
  • SAS, SPSS, Stata: Use the appropriate function from the haven package
  • SAS, SPSS, S, Stata, Systat, Epi Info, Minitab: Use the appropriate function from the foreign package
  • JSON: rjson or jsonlite packages
  • There are other packages to read in data, but these are the most commonly used / best maintained ones

2.2 Read in .csv file

gap <- read.csv("gapminder.csv")

Note

  1. You can give the read in data frame a name in R that is shorter than the file name. Here, I named the R data frame gap – much shorter to type.
  2. The data file is in the same folder as the .qmd file here, so I just give the file name. If it were in a different folder, you’d need to supply the full path, such as C:\Users\stefany\Desktop (Windows) or Users/stefany/Desktop (Mac)
  3. Sometimes the \s can cause problems because they’re “escape” characters. If you get an error about that, just switch the \s to /s

2.3 Check the read in .csv file

head(gap)
      country continent year lifeExp      pop gdpPercap
1 Afghanistan      Asia 1952  28.801  8425333  779.4453
2 Afghanistan      Asia 1957  30.332  9240934  820.8530
3 Afghanistan      Asia 1962  31.997 10267083  853.1007
4 Afghanistan      Asia 1967  34.020 11537966  836.1971
5 Afghanistan      Asia 1972  36.088 13079460  739.9811
6 Afghanistan      Asia 1977  38.438 14880372  786.1134

2.4 Read in SPSS file 1

library(haven)
gap_spss1 <- read_sav("gapminder.sav")

Note

  1. read_sav function from the haven package
  2. Again, you can give the read in data frame a name in R that is shorter than or different from the file name. I called this one gap_spss1 so it wouldn’t overwrite the CSV version that was just called gap.
  3. Same comments as above about file location, paths, and slashes

2.5 Check the read in SPSS file 1

head(gap_spss1)
# A tibble: 6 × 6
  country         continent  year lifeExp      pop gdpPercap
  <dbl+lbl>       <dbl+lbl> <dbl>   <dbl>    <dbl>     <dbl>
1 1 [Afghanistan] 3 [Asia]   1952    28.8  8425333      779.
2 1 [Afghanistan] 3 [Asia]   1957    30.3  9240934      821.
3 1 [Afghanistan] 3 [Asia]   1962    32.0 10267083      853.
4 1 [Afghanistan] 3 [Asia]   1967    34.0 11537966      836.
5 1 [Afghanistan] 3 [Asia]   1972    36.1 13079460      740.
6 1 [Afghanistan] 3 [Asia]   1977    38.4 14880372      786.

Note

  1. SPSS has a lot of additional features like labels for data. This is helpful if you have a multi-category variable assigned numbers (i.e., 1, 2, 3, 4) but you want to have information on what the values mean in the dataset. So here, the country and continent variables are type dbl+lbl: dbl is the number, lbl is the actual name label.

2.6 Read in SPSS 2

library(foreign)
gap_spss2 <- read.spss("gapminder.sav", 
                     to.data.frame = TRUE, 
                     use.value.labels = TRUE)

Note

  1. read.spss function from the foreign package
  2. Again, you can give the read in data frame a name in R that is shorter than or different from the file name. I called this one gap_spss2 so it wouldn’t overwrite the CSV version that was just called gap.
  3. Same comments as above about file location, paths, and slashes
  4. to.data.frame = TRUE forces the file into a data frame (in case it wasn’t already)
  5. use.value.labels = TRUE uses the variable labels from SPSS (ignoring the numbers)

2.7 Check the read in SPSS file 2

head(gap_spss2)
      country continent year lifeExp      pop gdpPercap
1 Afghanistan      Asia 1952  28.801  8425333  779.4453
2 Afghanistan      Asia 1957  30.332  9240934  820.8530
3 Afghanistan      Asia 1962  31.997 10267083  853.1007
4 Afghanistan      Asia 1967  34.020 11537966  836.1971
5 Afghanistan      Asia 1972  36.088 13079460  739.9811
6 Afghanistan      Asia 1977  38.438 14880372  786.1134

Note

  1. Note that country and continent are now the labels, not the numbers.

2.8 Reading in data

  • There are a lot of data formats and a lot of methods to read them in
  • If you commonly use a particular data format, make sure you know the ins and outs of the packages for that data format
    • And always look at your data to make sure it was read in correctly
    • head(), glimpse(), str(), etc.

3 “Tidy” your data

3.1 tidyr package

  • tidyr has a lot of features to “tidy” your data
    • Help get data into “tidy” format
      • Rows = observations, columns = variables
    • “Reshape” data from “wide” to “tall” format (and vice versa)
    • Split “combo” variables into multiple variables
    • Complex nested data, missing data

3.2 gapminder dataset

  • This dataset is tall / stacked / univariate
head(gap, n = 15)
       country continent year lifeExp      pop gdpPercap
1  Afghanistan      Asia 1952  28.801  8425333  779.4453
2  Afghanistan      Asia 1957  30.332  9240934  820.8530
3  Afghanistan      Asia 1962  31.997 10267083  853.1007
4  Afghanistan      Asia 1967  34.020 11537966  836.1971
5  Afghanistan      Asia 1972  36.088 13079460  739.9811
6  Afghanistan      Asia 1977  38.438 14880372  786.1134
7  Afghanistan      Asia 1982  39.854 12881816  978.0114
8  Afghanistan      Asia 1987  40.822 13867957  852.3959
9  Afghanistan      Asia 1992  41.674 16317921  649.3414
10 Afghanistan      Asia 1997  41.763 22227415  635.3414
11 Afghanistan      Asia 2002  42.129 25268405  726.7341
12 Afghanistan      Asia 2007  43.828 31889923  974.5803
13     Albania    Europe 1952  55.230  1282697 1601.0561
14     Albania    Europe 1957  59.280  1476505 1942.2842
15     Albania    Europe 1962  64.820  1728137 2312.8890
  • How can we convert this to a wide / multivariate format?

3.3 tidyr functions

pivot_wider(table2, names_from = type, values_from = count)
  • pivot_wider() function
    • table2 is the dataset you start from (in this case, the tall one)
    • names_from is the variable that defines the repeated measure
    • values_from is the variable you are interested in “widening”
      • You can have multiple variables

3.4 gapminder dataset

head(gap, n = 15)
       country continent year lifeExp      pop gdpPercap
1  Afghanistan      Asia 1952  28.801  8425333  779.4453
2  Afghanistan      Asia 1957  30.332  9240934  820.8530
3  Afghanistan      Asia 1962  31.997 10267083  853.1007
4  Afghanistan      Asia 1967  34.020 11537966  836.1971
5  Afghanistan      Asia 1972  36.088 13079460  739.9811
6  Afghanistan      Asia 1977  38.438 14880372  786.1134
7  Afghanistan      Asia 1982  39.854 12881816  978.0114
8  Afghanistan      Asia 1987  40.822 13867957  852.3959
9  Afghanistan      Asia 1992  41.674 16317921  649.3414
10 Afghanistan      Asia 1997  41.763 22227415  635.3414
11 Afghanistan      Asia 2002  42.129 25268405  726.7341
12 Afghanistan      Asia 2007  43.828 31889923  974.5803
13     Albania    Europe 1952  55.230  1282697 1601.0561
14     Albania    Europe 1957  59.280  1476505 1942.2842
15     Albania    Europe 1962  64.820  1728137 2312.8890
  • year defines the repeated measure
  • lifeExp, pop, and gdpPercap into one column per year

3.5 pivot_wider(): Tall to wide

library(tidyr)
gap_wide <- gap %>% pivot_wider(names_from = year, 
                                values_from = c(lifeExp, pop, gdpPercap))
  • Create new data frame called gap_wide
  • Assign to it (<-) the data frame called gap
  • Then “pipe” (%>%) gap into the pivot_wider() function
  • pivot_wider() has 2 arguments (plus the data)
    • names_from = year
    • values_from = c(lifeExp, pop, gdpPercap)

3.6 pivot_wider(): Now wide

head(gap_wide)
# A tibble: 6 × 38
  country     continent lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967
  <chr>       <chr>            <dbl>        <dbl>        <dbl>        <dbl>
1 Afghanistan Asia              28.8         30.3         32.0         34.0
2 Albania     Europe            55.2         59.3         64.8         66.2
3 Algeria     Africa            43.1         45.7         48.3         51.4
4 Angola      Africa            30.0         32.0         34           36.0
5 Argentina   Americas          62.5         64.4         65.1         65.6
6 Australia   Oceania           69.1         70.3         70.9         71.1
# ℹ 32 more variables: lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
#   lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
#   lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, pop_1952 <int>,
#   pop_1957 <int>, pop_1962 <int>, pop_1967 <int>, pop_1972 <int>,
#   pop_1977 <int>, pop_1982 <int>, pop_1987 <int>, pop_1992 <int>,
#   pop_1997 <int>, pop_2002 <int>, pop_2007 <int>, gdpPercap_1952 <dbl>,
#   gdpPercap_1957 <dbl>, gdpPercap_1962 <dbl>, gdpPercap_1967 <dbl>, …

3.7 pivot_longer(): Wide to tall

library(tidyr)
gap_tall <- gap_wide %>% pivot_longer(cols = 3:38, 
                                      names_to = "var_year", 
                                      values_to = "value")

Note

  • You (and I) want this to magically turn the wide data back to its original form. It will not. Unfortunately, this relatively simple situation is actually pretty complicated.
  • Instead, we get two columns
    • Column 1: Combination of the variable name and the year variable
    • Column 2: Actual value for that combination
  • A few more steps to get it back (if that’s what you wanted – it might not be)
    • Chance to learn a couple more functions (woo!)

3.8 pivot_longer(): Now tall(ish)

print(gap_tall, n = 30)
# A tibble: 5,112 × 4
   country     continent var_year            value
   <chr>       <chr>     <chr>               <dbl>
 1 Afghanistan Asia      lifeExp_1952         28.8
 2 Afghanistan Asia      lifeExp_1957         30.3
 3 Afghanistan Asia      lifeExp_1962         32.0
 4 Afghanistan Asia      lifeExp_1967         34.0
 5 Afghanistan Asia      lifeExp_1972         36.1
 6 Afghanistan Asia      lifeExp_1977         38.4
 7 Afghanistan Asia      lifeExp_1982         39.9
 8 Afghanistan Asia      lifeExp_1987         40.8
 9 Afghanistan Asia      lifeExp_1992         41.7
10 Afghanistan Asia      lifeExp_1997         41.8
11 Afghanistan Asia      lifeExp_2002         42.1
12 Afghanistan Asia      lifeExp_2007         43.8
13 Afghanistan Asia      pop_1952        8425333  
14 Afghanistan Asia      pop_1957        9240934  
15 Afghanistan Asia      pop_1962       10267083  
16 Afghanistan Asia      pop_1967       11537966  
17 Afghanistan Asia      pop_1972       13079460  
18 Afghanistan Asia      pop_1977       14880372  
19 Afghanistan Asia      pop_1982       12881816  
20 Afghanistan Asia      pop_1987       13867957  
21 Afghanistan Asia      pop_1992       16317921  
22 Afghanistan Asia      pop_1997       22227415  
23 Afghanistan Asia      pop_2002       25268405  
24 Afghanistan Asia      pop_2007       31889923  
25 Afghanistan Asia      gdpPercap_1952      779. 
26 Afghanistan Asia      gdpPercap_1957      821. 
27 Afghanistan Asia      gdpPercap_1962      853. 
28 Afghanistan Asia      gdpPercap_1967      836. 
29 Afghanistan Asia      gdpPercap_1972      740. 
30 Afghanistan Asia      gdpPercap_1977      786. 
# ℹ 5,082 more rows

3.9 Tall to wide to tall

Note

  • A lot of the things you’ll do with pivot_wider() and pivot_longer() are a little bespoke to your particular dataset
    • I used names_from to create the new variables, but there are a lot of other options, depending on how your dataset is set up
    • Check the documentation for tidyr to see all the options, as well as examples
  • I don’t think that I have ever used pivot_wider() or pivot_longer() successfully on the first try

3.10 Split cells

  • The var_year variable has 2 pieces of information:
    • The original variable: lifeExp, pop, gpdPercap
    • The year variable: 1952, 1957, etc.
    • (tidyr example: rates, such as 0.7K/10M, 2K/20M, etc.)
  • Want to split those two pieces of information into 2 variables
    • separate(): multiple columns
    • separate_longer_delim(): multiple rows

3.11 Split cells: separate()

  • Also listed in documentation as separate_wider_delim()
gap_tall2 <- gap_tall %>% separate(var_year, 
                                    sep = "_", 
                                    into = c("variable", "year"))
  • var_year is the variable we want to split
  • sep is the separation character – here, underscore (_)
  • into are the names of the new variables: variable and year

3.12 Split cells: separate()

gap_tall2
# A tibble: 5,112 × 5
   country     continent variable year  value
   <chr>       <chr>     <chr>    <chr> <dbl>
 1 Afghanistan Asia      lifeExp  1952   28.8
 2 Afghanistan Asia      lifeExp  1957   30.3
 3 Afghanistan Asia      lifeExp  1962   32.0
 4 Afghanistan Asia      lifeExp  1967   34.0
 5 Afghanistan Asia      lifeExp  1972   36.1
 6 Afghanistan Asia      lifeExp  1977   38.4
 7 Afghanistan Asia      lifeExp  1982   39.9
 8 Afghanistan Asia      lifeExp  1987   40.8
 9 Afghanistan Asia      lifeExp  1992   41.7
10 Afghanistan Asia      lifeExp  1997   41.8
# ℹ 5,102 more rows

3.13 Combine cells: unite()

  • tidyr example
    • Column 1 is century: 19 or 20
    • Column 2 is year: 00 to 99
    • Combine into a single year variable: 1998, 1999, 2000, 2001
unite(data, century, year, col = "year", sep = "")

3.14 Combine cells: unite()

  • Combine data across sites or cohorts
    • Column 1 is site: 1, 2, 3, 4
    • Column 2 is ID: 001, 002, 003, etc.
    • Combine into a single newID variable: 1001, 1002, 1003, 2001, 2002
unite(data, site, ID, col = "newID", sep = "")

3.15 Missing values

Note

The tidyr functions for missing values might be useful for something. Dropping missing values or replacing them with a specific value is NOT a good choice in experimental data. We don’t have time to go in much depth with this but check the References for more info on handling missing data.

  • I know that it’s common practice in some fields to do single imputation or list-wise deletion – but those are not good methods and introduce bias in your later analysis
    • Sometimes there aren’t a lot of good options, but be aware of these major limitations

3.16 Replacing missing values

  • The missing value code in R is NA
  • If you read in external data, missing values may have other values
    • Some software allows a blank
    • SPSS default: .
    • Common practice in surveys to use 99, -99, 999, or -999 (depending on the scale) for missing and 98, -98, 998, -998 for refused to answer

3.17 Replacing missing values

  • If R recognizes your missing value as a missing value, it will replace it with NA
    • If not, you’ll need to manually replace your missing values with NA
  • Replace all missing value codes (here, -999) with NA
mydata[mydata == -999] <- NA

4 Manipulate your data

4.1 mutate(): Create new variables

  • Create gdp variable from gdpPercap and pop
library(dplyr)
gap1 <- gap %>% mutate(gdp = gdpPercap * pop)
  • Can create multiple new variables in same mutate() statement

4.2 mutate(): Create new variables

head(gap1)
      country continent year lifeExp      pop gdpPercap         gdp
1 Afghanistan      Asia 1952  28.801  8425333  779.4453  6567086330
2 Afghanistan      Asia 1957  30.332  9240934  820.8530  7585448670
3 Afghanistan      Asia 1962  31.997 10267083  853.1007  8758855797
4 Afghanistan      Asia 1967  34.020 11537966  836.1971  9648014150
5 Afghanistan      Asia 1972  36.088 13079460  739.9811  9678553274
6 Afghanistan      Asia 1977  38.438 14880372  786.1134 11697659231

4.3 select(): Select columns (variables)

  • Select only country, continent, year, and lifeExp
library(dplyr)
gap2 <- gap %>% select(c(country, continent, year, lifeExp))
  • Check dplyr documentation
    • Select multiple with c() function
    • A lot of options and helper functions to help you select only certain variables, such as starts_with(), ends_with(), contains()

4.4 select(): Select columns

head(gap2)
      country continent year lifeExp
1 Afghanistan      Asia 1952  28.801
2 Afghanistan      Asia 1957  30.332
3 Afghanistan      Asia 1962  31.997
4 Afghanistan      Asia 1967  34.020
5 Afghanistan      Asia 1972  36.088
6 Afghanistan      Asia 1977  38.438

4.5 filter(): Select rows

  • Retain only countries in Asia
library(dplyr)
gap3 <- gap %>% filter(continent == "Asia")
  • Check dplyr documentation
    • Multiple criteria with &, any criteria with |, not this criteria with !
    • For numeric values, you can use <, >, <=, >=, etc.

4.6 filter(): Select rows

head(gap3, n = 15)
       country continent year lifeExp      pop  gdpPercap
1  Afghanistan      Asia 1952  28.801  8425333   779.4453
2  Afghanistan      Asia 1957  30.332  9240934   820.8530
3  Afghanistan      Asia 1962  31.997 10267083   853.1007
4  Afghanistan      Asia 1967  34.020 11537966   836.1971
5  Afghanistan      Asia 1972  36.088 13079460   739.9811
6  Afghanistan      Asia 1977  38.438 14880372   786.1134
7  Afghanistan      Asia 1982  39.854 12881816   978.0114
8  Afghanistan      Asia 1987  40.822 13867957   852.3959
9  Afghanistan      Asia 1992  41.674 16317921   649.3414
10 Afghanistan      Asia 1997  41.763 22227415   635.3414
11 Afghanistan      Asia 2002  42.129 25268405   726.7341
12 Afghanistan      Asia 2007  43.828 31889923   974.5803
13     Bahrain      Asia 1952  50.939   120447  9867.0848
14     Bahrain      Asia 1957  53.832   138655 11635.7995
15     Bahrain      Asia 1962  56.923   171863 12753.2751

4.7 summarise(): Summarize data

  • Create a new data frame with summary statistics, such as the mean
library(dplyr)
gap4 <- gap %>% summarize(meanlifeExp = mean(lifeExp))

4.8 summarise(): Summarize data

head(gap4)
  meanlifeExp
1    59.47444
  • New data frame with the mean of lifeExp
    • Is this really useful? The mean for the whole dataset?
    • Stay tuned

4.9 arrange(): Change row order

  • Sort rows from smallest to largest value (ascending, default)
library(dplyr)
gap5 <- gap %>% arrange(lifeExp)
  • Sort in descending order by wrapping variable name(s) in desc()

4.10 arrange(): Change row order

head(gap5)
       country continent year lifeExp     pop gdpPercap
1       Rwanda    Africa 1992  23.599 7290203  737.0686
2  Afghanistan      Asia 1952  28.801 8425333  779.4453
3       Gambia    Africa 1952  30.000  284320  485.2307
4       Angola    Africa 1952  30.015 4232095 3520.6103
5 Sierra Leone    Africa 1952  30.331 2143249  879.7877
6  Afghanistan      Asia 1957  30.332 9240934  820.8530

4.11 group_by(): Separately by group

  • Do any of the above operations separately for each group
  • Get mean life expectancy for each continent
library(dplyr)
gap6 <- gap %>% group_by(continent) %>% 
                summarize(meanlifeExp = mean(lifeExp)) %>% 
                ungroup()

Note

  • VERY IMPORTANT
    • Make sure to ungroup() after you do whatever you need the group_by() for
    • If you don’t, anything you do will still be done by group, including analyses

4.12 group_by(): Separately by group

head(gap6)
# A tibble: 5 × 2
  continent meanlifeExp
  <chr>           <dbl>
1 Africa           48.9
2 Americas         64.7
3 Asia             60.1
4 Europe           71.9
5 Oceania          74.3

5 In-class activities

5.1 In-class activities

  • Read data into R from external formats (e.g., Excel, .csv, SPSS)
  • Use dplyr functions in R to manipulate datasets
  • Use tidyr functions in R to organize and re-structure datasets