Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Understand the different types of variables that can be measured
  • Think about variables in terms of central tendency or location
  • Think about variables in terms of dispersion or spread
  • Start to think about variable distributions and plots

2 Types of variables

2.1 Types of variables

  • There are a lot of ways to talk about variables
    • Continuous vs categorical
    • Quantitative vs qualitative
    • Numeric vs non-numeric
  • Often loose, ambiguous, and arbitrary

2.2 Levels of measurement

  • Four ordered levels of measurement based on the mathematical operations that can be performed
    • Nominal
    • Ordinal
    • Interval
    • Ratio

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680.

2.3 Nominal variables

  • Categories with no intrinsic ordering
    • Nominal = “name”
  • Examples
    • Department: Psychology, Epidemiology, Statistics, Business
    • Religion: Christian, Jewish, Muslim, Atheist
    • Ice cream flavor: vanilla, chocolate, strawberry

2.4 Ordinal variables

  • Categories with some intrinsic ordering
    • Ordinal = “ordered”
    • Differences between categories are not meaningful/equal
  • Examples
    • Dose of treatment: low, medium, high
    • Rankings: 1st, 2nd, 3rd, 4th
    • Education: high school, some college, college grad, graduate
    • Likert scales: agree, neutral, disagree

2.5 Interval variables

  • Quantitative variables with no meaningful 0 point
    • (“Meaningful 0”: value of 0 = nothing)
    • Differences between values are meaningful but ratios are not!
  • Example: Temperature in Fahrenheit or Celsius
    • Difference from 100F to 90F = difference from 90F to 80F
    • But 100F is not twice 50F (because 0F is arbitrary)
  • Most “continuous” variables you deal with are interval
    • Most statistical procedures assume interval-level measurement

2.6 Ratio variables

  • Quantitative variables with meaningful 0 point
    • (“Meaningful 0”: value of 0 = nothing)
    • Differences between values are meaningful and so are ratios!
  • Example: Temperature in Kelvin
    • Difference from 100K to 90K = difference from 90K to 80K
    • 100K is twice as hot as 50K (0K is zero molecular movement)
  • Height, weight, age, counts

2.7 Stevens (1946)

The levels of measurement determine what mathematical (and statistical) operations you can perform

Mathematical operation Nominal Ordinal Interval Ratio
equal, not equal \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)
greater or less than \(\checkmark\) \(\checkmark\) \(\checkmark\)
add, subtract \(\checkmark\) \(\checkmark\)
multiply, divide \(\checkmark\)
central tendency mode median mean mean

2.8 R “translations” of variable types

  • Nominal
    • char = “character”: Text variable, also called “string”
  • Nominal or ordinal
    • fctr = “factor”: Ordered or unordered categories
    • log = “logical” or “boolean”: TRUE or FALSE
  • Interval or ratio
    • int = “integer”: Whole number (no decimals)
    • dbl = “double” (num): Number w possible decimal places

3 Central tendency

3.1 Central tendency

  • Also called “location”
    • Where are the observations located?
  • Several measures of central tendency
    • Depend on scale type
    • Mode, median, mean

3.2 Mode

  • The most frequently occurring value
    • Value actually exists in the data
  • Used for nominal, ordinal
    • Can also be used for interval, and ratio variables but less useful

3.3 Median

  • Order observations from smallest to largest
    • Select the middle observation
    • Value actually exists in the data (if \(n\) is even)
  • Used for ordinal, interval, and ratio variables

3.4 Mean

  • Add up all observations and divide by the number of observations
    • Value may not actually occur in the data
  • Mathematically: \(\frac{\sum_{i=1}^n X_i}{n}\)
  • Used for interval and ratio variables

3.5 gapminder dataset

gap <- gapminder
head(gap)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

3.6 gapminder variables

Variable R variable type Scale type
country factor nominal
continent factor nominal
year integer ordinal / interval
lifeExp double / numeric ratio
pop integer ratio (count)
gdpPercap double / numeric ratio

3.7 Nominal variables: Mode

  • Mode() function from DescTools package
Mode(gap$country)
  [1] Afghanistan              Albania                  Algeria                 
  [4] Angola                   Argentina                Australia               
  [7] Austria                  Bahrain                  Bangladesh              
 [10] Belgium                  Benin                    Bolivia                 
 [13] Bosnia and Herzegovina   Botswana                 Brazil                  
 [16] Bulgaria                 Burkina Faso             Burundi                 
 [19] Cambodia                 Cameroon                 Canada                  
 [22] Central African Republic Chad                     Chile                   
 [25] China                    Colombia                 Comoros                 
 [28] Congo, Dem. Rep.         Congo, Rep.              Costa Rica              
 [31] Cote d'Ivoire            Croatia                  Cuba                    
 [34] Czech Republic           Denmark                  Djibouti                
 [37] Dominican Republic       Ecuador                  Egypt                   
 [40] El Salvador              Equatorial Guinea        Eritrea                 
 [43] Ethiopia                 Finland                  France                  
 [46] Gabon                    Gambia                   Germany                 
 [49] Ghana                    Greece                   Guatemala               
 [52] Guinea                   Guinea-Bissau            Haiti                   
 [55] Honduras                 Hong Kong, China         Hungary                 
 [58] Iceland                  India                    Indonesia               
 [61] Iran                     Iraq                     Ireland                 
 [64] Israel                   Italy                    Jamaica                 
 [67] Japan                    Jordan                   Kenya                   
 [70] Korea, Dem. Rep.         Korea, Rep.              Kuwait                  
 [73] Lebanon                  Lesotho                  Liberia                 
 [76] Libya                    Madagascar               Malawi                  
 [79] Malaysia                 Mali                     Mauritania              
 [82] Mauritius                Mexico                   Mongolia                
 [85] Montenegro               Morocco                  Mozambique              
 [88] Myanmar                  Namibia                  Nepal                   
 [91] Netherlands              New Zealand              Nicaragua               
 [94] Niger                    Nigeria                  Norway                  
 [97] Oman                     Pakistan                 Panama                  
[100] Paraguay                 Peru                     Philippines             
[103] Poland                   Portugal                 Puerto Rico             
[106] Reunion                  Romania                  Rwanda                  
[109] Sao Tome and Principe    Saudi Arabia             Senegal                 
[112] Serbia                   Sierra Leone             Singapore               
[115] Slovak Republic          Slovenia                 Somalia                 
[118] South Africa             Spain                    Sri Lanka               
[121] Sudan                    Swaziland                Sweden                  
[124] Switzerland              Syria                    Taiwan                  
[127] Tanzania                 Thailand                 Togo                    
[130] Trinidad and Tobago      Tunisia                  Turkey                  
[133] Uganda                   United Kingdom           United States           
[136] Uruguay                  Venezuela                Vietnam                 
[139] West Bank and Gaza       Yemen, Rep.              Zambia                  
[142] Zimbabwe                
attr(,"freq")
[1] 12
142 Levels: Afghanistan Albania Algeria Angola Argentina Australia ... Zimbabwe
  • There is no single mode for country

3.8 Nominal variables: Mode

  • Mode() function from DescTools package
Mode(gap$continent)
[1] Africa
attr(,"freq")
[1] 624
Levels: Africa Americas Asia Europe Oceania
  • There are 52 countries (52 countries * 12 years = 624) in Africa
    • The value of “Africa” is the mode of the continent variable

3.9 Ordinal variables: Median

median(gap$year, na.rm = TRUE)
[1] 1979.5
  • 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007
    • Median is halfway between 1977 and 1982
  • na.rm = TRUE option removes missing values
    • If you have missing values and don’t include this, you will get NA

3.10 Interval variables: Mean

mean(gap$year, na.rm = TRUE)
[1] 1979.5
  • \((1952 + 1957 + \dots + 2002 + 2007)/12\)
  • You can also calculate the median of an interval variable, but we just did that for this variable (treating it as ordinal)
  • na.rm = TRUE option removes missing values
    • If you have missing values and don’t include this, you will get NA

3.11 Use only data from 2007

  • Including all data gives us a confusing mish-mash of information
    • I’m just going to include data from 2007 for these variables
  • Also going to create the gdp variable at the same time
gap_2007 <- gapminder %>% 
  mutate(gdp = gdpPercap * pop) %>%
  filter(year == 2007)

3.12 Ratio variables: Median

median(gap_2007$lifeExp, na.rm = TRUE)
[1] 71.9355
median(gap_2007$pop, na.rm = TRUE)
[1] 10517531
median(gap_2007$gdpPercap, na.rm = TRUE)
[1] 6124.371
median(gap_2007$gdp, na.rm = TRUE)
[1] 57869055458
  • 50% of countries have a life expectancy less than 71.9355 years
  • 50% of countries have a life expectancy greater than 71.9355 years

3.13 Ratio variables: Mean

mean(gap_2007$lifeExp, na.rm = TRUE)
[1] 67.00742
mean(gap_2007$pop, na.rm = TRUE)
[1] 44021220
mean(gap_2007$gdpPercap, na.rm = TRUE)
[1] 11680.07
mean(gap_2007$gdp, na.rm = TRUE)
[1] 409220666999
  • The average life expectancy is 67.00742 years

3.14 Compare mean and median

  • Mean and median are both measures of central tendency for numbers
    • What does it mean when they differ?
  • Median involves only values near it
    • Extreme values don’t impact it
  • Mean involves all values
    • Extreme values impact the mean
    • Mean is “pulled toward” extreme values

3.15 Mean and median of lifeExp

Code
ggplot(data = gap_2007, aes(x = lifeExp)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(gap_2007$lifeExp), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(gap_2007$lifeExp), color = "red", linewidth = 1.5, linetype = "dashed")

3.16 Mean and median of pop

Code
ggplot(data = gap_2007, aes(x = pop)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(gap_2007$pop), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(gap_2007$pop), color = "red", linewidth = 1.5, linetype = "dashed")

3.17 Mean and median of gdpPercap

Code
ggplot(data = gap_2007, aes(x = gdpPercap)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(gap_2007$gdpPercap), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(gap_2007$gdpPercap), color = "red", linewidth = 1.5, linetype = "dashed")

4 Dispersion

4.1 Dispersion

  • Also called “spread”
    • How spread out or dispersed are the observations?
    • Only applies to numeric (interval or ratio) variables
  • Several measures of dispersion
    • Depend on scale type and variable distribution
    • Minimum and maximum, percentiles, standard deviation

4.2 Minimum and maximum

  • Minimum
min(gap_2007$lifeExp, na.rm = TRUE)
[1] 39.613
min(gap_2007$pop, na.rm = TRUE)
[1] 199579
min(gap_2007$gdpPercap, na.rm = TRUE)
[1] 277.5519
  • Maximum
max(gap_2007$lifeExp, na.rm = TRUE)
[1] 82.603
max(gap_2007$pop, na.rm = TRUE)
[1] 1318683096
max(gap_2007$gdpPercap, na.rm = TRUE)
[1] 49357.19

4.3 Percentiles

  • Value below which some percentage of observations lie
    • 67th percentile: 67% of observations have values below this value
    • Median is the 50th percentile: 50% below (and 50% above)
    • Quartiles divide into 4 parts: 25th, 50th, 75th percentiles
    • Standardized tests: 97th percentile means you scored higher than 97% of people taking the test

4.4 Percentiles

quantile(gap_2007$lifeExp, c(0.25, 0.5, 0.75))
     25%      50%      75% 
57.16025 71.93550 76.41325 
quantile(gap_2007$pop, c(0.25, 0.5, 0.75))
     25%      50%      75% 
 4508034 10517531 31210042 
quantile(gap_2007$gdpPercap, c(0.1, 0.9))
       10%        90% 
  887.2871 33644.0530 

4.5 Standard deviation (& variance)

  • Spread around the mean
  • Same units as the variable
    • Unlike variance, with is \(SD^2\)
  • Influenced by extreme values
  • Mathematically: \(\sqrt{\frac{\sum (X_i - \overline{X})^2}{n-1}}\)

4.6 Presenting location & dispersion

  • Symmetric distribution
    • Mean \(\pm\) standard deviation
    • Median (25th to 75th percentile)
    • Median (min, max)
  • Asymmetric distribution
    • Median (25th to 75th percentile)
    • Median (min, max)

4.7 Presenting

Code
lifeexp_perc <- quantile(gap_2007$lifeExp, c(0, 0.25, 0.5, 0.75, 1))
lifeexp_sd <- sd(gap_2007$lifeExp, na.rm = TRUE)
lifeexp_mean <- mean(gap_2007$lifeExp, na.rm= TRUE)
ggplot(data = gap_2007, aes(x = lifeExp)) + 
  geom_histogram(bins = 30, color = "grey80")

4.8 Presenting: Mean plus/minus 1 SD

Code
ggplot(data = gap_2007, aes(x = lifeExp)) + 
  geom_histogram(bins = 30, color = "grey80") +
  annotate("pointrange", x = lifeexp_mean, y = 5, 
            xmin = mean(gap_2007$lifeExp) - lifeexp_sd,
            xmax = mean(gap_2007$lifeExp) + lifeexp_sd,
            size = 1.5, linewidth = 1.5, color = "red") +
  ylab("count")

4.9 Presenting: Median, 25th, 75th %iles

Code
ggplot(data = gap_2007, aes(x = lifeExp)) + 
  geom_histogram(bins = 30, color = "grey80") +
  annotate("pointrange", x = lifeexp_perc[3], y = 5, 
            xmin = lifeexp_perc[2],
            xmax = lifeexp_perc[4],
            size = 1.5, linewidth = 1.5, color = "red") +
  ylab("count")

5 In-class activities

5.1 In-class activities

  • Look at some data and variables
    • Think about what type(s) of variables they are
  • Examine central tendency and dispersion for the variables
    • Present the best values to describe the data
  • Start to think about variable distributions and plots