Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Select an appropriate plot for the variable type
  • Create plots in ggplot2 (part of the tidyverse)

2 Plots

2.1 Levels of measurement

  • Four ordered levels of measurement based on the mathematical operations that can be performed
    • Nominal
    • Ordinal
    • Interval
    • Ratio

Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677-680.

2.2 Nominal variables

  • Categories with no intrinsic ordering
    • Nominal = “name”
  • Examples
    • Department: Psychology, Epidemiology, Statistics, Business
    • Religion: Christian, Jewish, Muslim, Atheist
    • Ice cream flavor: vanilla, chocolate, strawberry

2.3 Ordinal variables

  • Categories with some intrinsic ordering
    • Ordinal = “ordered”
    • Differences between categories are not meaningful/equal
  • Examples
    • Dose of treatment: low, medium, high
    • Rankings: 1st, 2nd, 3rd, 4th
    • Education: high school, some college, college grad, graduate
    • Likert scales: agree, neutral, disagree

2.4 Interval variables

  • Quantitative variables with no meaningful 0 point
    • (“Meaningful 0”: value of 0 = nothing)
    • Differences between values are meaningful but ratios are not!
  • Example: Temperature in Fahrenheit or Celsius
    • Difference from 100F to 90F = difference from 90F to 80F
    • But 100F is not twice 50F (because 0F is arbitrary)
  • Most “continuous” variables you deal with are interval
    • Most statistical procedures assume interval-level measurement

2.5 Ratio variables

  • Quantitative variables with meaningful 0 point
    • (“Meaningful 0”: value of 0 = nothing)
    • Differences between values are meaningful and so are ratios!
  • Example: Temperature in Kelvin
    • Difference from 100K to 90K = difference from 90K to 80K
    • 100K is twice as hot as 50K (0K is zero molecular movement)
  • Height, weight, age, counts

2.6 Stevens (1946)

The levels of measurement determine what mathematical (and statistical) operations you can perform

Mathematical operation Nominal Ordinal Interval Ratio
equal, not equal \(\checkmark\) \(\checkmark\) \(\checkmark\) \(\checkmark\)
greater or less than \(\checkmark\) \(\checkmark\) \(\checkmark\)
add, subtract \(\checkmark\) \(\checkmark\)
multiply, divide \(\checkmark\)
central tendency mode median mean mean

2.7 Variable types

R variable type Stevens (1946) variable type ggplot2 *
chr or character nominal discrete
fct or factor nominal or ordinal discrete
log or logical nominal or ordinal discrete
int or integer interval or ratio continuous
dbl or double interval or ratio continuous

* From the ggplot2 cheatsheet

2.8 Why do we make plots?

  • Exploration

  • Analysis

  • Presentation

  • Sometimes, the best way to convey really complicated information

2.9 Baby data - 5.5 mo vs 10.5 mo

2.10 How we’d like it to work

https://xkcd.com/2400/

2.11 How it really works

https://www.instagram.com/twisteddoodles/

3 Data

3.1 Data

  • ICU data from the Stat2Data package
    • ID: Patient ID code
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Age: Age (in years)
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)
    • Emergency: 1 = emergency admission or 0 = elective admission

3.2 Data

  • Nominal / ordinal / factor / discrete / binary
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • Emergency: 1 = emergency admission or 0 = elective admission
  • Ordinal / factor / discrete
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)

3.3 Data

  • Ratio / integer / numeric / continuous
    • Age: Age (in years)
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)

3.4 Data

library(Stat2Data)
data(ICU)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

3.5 Change variable types

ICU$Survive <- as.factor(ICU$Survive)
ICU$Sex <- as.factor(ICU$Sex)
ICU$Infection <- as.factor(ICU$Infection)
ICU$Emergency <- as.factor(ICU$Emergency)
ICU$AgeGroup <- as.factor(ICU$AgeGroup)
  • Default for factor variable categories is numerical order or alphabetical order
  • Nothing we’re doing today requires this, but it can matter for more complex figures and if you want to do something really specific
    • You can also change the variable type within ggplot()

4 ggplot2 package

4.1 ggplot2 package

  • Wilkinson, L. (2005). The Grammar of Graphics (2nd ed.). Statistics and Computing, New York: Springer.
  • Wickham, H. (2010). A layered grammar of graphics. Journal of computational and graphical statistics, 19(1), 3-28.
  • Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
  • Online documentation

4.2 Grammar of graphics

  • Grammar gives language rules. – Leland Wilkinson
  • The grammar tells us what words make up our graphical “sentences,” but offers no advice on how to write well. – Hadley Wickham
    • “Colorless green ideas sleep furiously”

4.3 Grammar of graphics

  • Data
  • Variables
  • Geometry
  • Aesthetics
  • Algebra
  • Scales
  • Statistics
  • Coordinates
  • Facets

4.4 ggplot() structure

ggplot(data = DATA,
       aes(x = XVAR, y = YVAR)) +
       geom_MYGEOM() +
       geom_ANOTHERGEOM() +
       SOME_OTHER_THING +
       ANOTHER_THING

5 Plots for a single variable

5.1 Different kinds of plots

https://www.instagram.com/twisteddoodles/

5.2 One discrete: geom_bar()

ggplot(data = ICU, 
       aes(x = Survive)) + 
       geom_bar()

5.3 One discrete: geom_bar()

ggplot(data = ICU, 
       aes(x = AgeGroup)) + 
       geom_bar()

5.4 One continuous: geom_bar()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_bar()

5.5 One continuous: geom_histogram()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30)

5.6 One continuous: geom_histogram()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 60)

5.7 One continuous: geom_histogram()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(binwidth = 2)

5.8 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 1)

5.9 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 2)

5.10 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 2, 
                    method = "histodot")

5.11 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 2, 
                    method = "histodot", 
                    stackdir = "center")

5.12 One continuous: geom_density()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_density()

5.13 One continuous: geom_violin()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_violin(aes(y = 0))

5.14 One continuous: violin + dotplot

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_violin(aes(y = 0)) + 
       geom_dotplot(method = "histodot", 
                    stackdir = "center", 
                    binwidth = 2)

Note

geoms are layered in the order they’re listed. geom_violin() is opaque white (by default), so if you listed geom_dotplot() and then geom_violin(), the dots would be almost completely covered.

5.15 One continuous: stat_qq

ggplot(data = ICU,
       aes(sample = Age)) + 
       stat_qq() +
       stat_qq_line()

6 Plots for two variables

6.1 Two discrete: geom_count()

ggplot(data = ICU, 
       aes(x = Sex, y = Survive)) + 
       geom_count()

6.2 Two discrete: geom_jitter()

ggplot(data = ICU, 
       aes(x = Sex, y = Survive)) + 
       geom_jitter()

6.3 Two discrete: geom_jitter()

ggplot(data = ICU, 
       aes(x = Sex, y = Survive)) + 
       geom_jitter(height = 0.25, 
                   width = 0.25)

6.4 Two continuous: geom_point()

ggplot(data = ICU, 
       aes(x = SysBP, y = Pulse)) + 
       geom_point()

6.5 Two continuous: geom_smooth()

ggplot(data = ICU, 
       aes(x = SysBP, y = Pulse)) + 
       geom_point() + 
       geom_smooth()

6.6 Two continuous: geom_smooth()

ggplot(data = ICU, 
       aes(x = SysBP, y = Pulse)) + 
       geom_point() + 
       geom_smooth(method = "lm", 
                   se = FALSE)

6.7 One of each: geom_col()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_col()

6.8 One of each: geom_boxplot()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_boxplot()

6.9 One of each: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_dotplot(method = "histodot",
                    binaxis = "y", 
                    stackdir = "center")

6.10 One of each: geom_violin()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_violin()

6.11 One of each: violin + dotplot

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_violin() + 
       geom_dotplot(method = "histodot",
                    binaxis = "y", 
                    stackdir = "center")

7 A few more things

7.1 Vertical lines

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30) + 
       geom_vline(xintercept = mean(ICU$Age, 
                                    na.rm = TRUE), 
                  color = "blue", 
                  linewidth = 1.5) + 
       geom_vline(xintercept = median(ICU$Age, 
                                      na.rm = TRUE), 
                  color = "red", 
                  linewidth = 1.5, 
                  linetype = "dashed")

7.2 Colors

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30, 
       color = "black", 
       fill = "royalblue")

  • A good resource on colors in R is here

7.3 Default theme

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30)

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30) +
       theme_gray()

8 In-class activities

8.1 In-class activities

  • Make plots in ggplot2
  • Select among different plots for the same variable

8.2 Next week

  • Colors and opacity
  • Error bars
  • Annotations (reference lines, cut-offs, text)
  • Changing some common things (themes, labels, re-ordering categories)
  • Complex combined plots (rugs, raincloud plots, index plot)