Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Select an appropriate plot for the variable type
  • Create plots in ggplot2 (part of the tidyverse)

2 Plots

2.1 Review: Variable types

R variable type Stevens (1946) variable type ggplot2 *
chr nominal discrete
fct nominal or ordinal discrete
log nominal or ordinal discrete
int interval or ratio continuous
dbl interval or ratio continuous

* From the ggplot2 cheatsheet

2.2 Why do we make plots?

  • Exploration

  • Analysis

  • Presentation

  • Sometimes, the best way to convey really complicated information

2.3 Baby data - 5.5 mo vs 10.5 mo

2.4 How we’d like it to work

https://xkcd.com/2400/

2.5 How it really works

https://www.instagram.com/twisteddoodles/

3 Data

3.1 Data

  • ICU data from the Stat2Data package
    • ID: Patient ID code
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Age: Age (in years)
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)
    • Emergency: 1 = emergency admission or 0 = elective admission

3.2 Data

  • Nominal / ordinal / factor / discrete / binary
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • Emergency: 1 = emergency admission or 0 = elective admission
  • Ordinal / factor / discrete
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)

3.3 Data

  • Ratio / integer / numeric / continuous
    • Age: Age (in years)
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)

3.4 Data

library(Stat2Data)
data(ICU)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

3.5 Change variable types

ICU$Survive <- as.factor(ICU$Survive)
ICU$Sex <- as.factor(ICU$Sex)
ICU$Infection <- as.factor(ICU$Infection)
ICU$Emergency <- as.factor(ICU$Emergency)
ICU$AgeGroup <- as.factor(ICU$AgeGroup)
  • Default for factor variable categories is numerical order or alphabetical order
  • Nothing we’re doing today requires this, but it can matter for more complex figures and if you want to do something really specific
    • You can also change the variable type within ggplot()

4 ggplot2 package

4.1 ggplot2 package

  • Wilkinson, L. (2005). The Grammar of Graphics (2nd ed.). Statistics and Computing, New York: Springer.
  • Wickham, H. (2010). A layered grammar of graphics. Journal of computational and graphical statistics, 19(1), 3-28.
  • Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
  • Online documentation

4.2 Grammar of graphics

  • Grammar gives language rules. – Leland Wilkinson
  • The grammar tells us what words make up our graphical “sentences,” but offers no advice on how to write well. – Hadley Wickham
    • “Colorless green ideas sleep furiously”

4.3 Grammar of graphics

  • Data
  • Variables
  • Geometry
  • Aesthetics
  • Algebra
  • Scales
  • Statistics
  • Coordinates
  • Facets

4.4 ggplot() structure

ggplot(data = DATA,
       aes(x = XVAR, y = YVAR)) +
       geom_MYGEOM() +
       geom_ANOTHERGEOM() +
       SOME_OTHER_THING +
       ANOTHER_THING

5 Plots for a single variable

5.1 Different kinds of plots

https://www.instagram.com/twisteddoodles/

5.2 One discrete: geom_bar()

ggplot(data = ICU, 
       aes(x = Survive)) + 
       geom_bar()

5.3 One discrete: geom_bar()

ggplot(data = ICU, 
       aes(x = AgeGroup)) + 
       geom_bar()

5.4 One continuous: geom_bar()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_bar()

5.5 One continuous: geom_histogram()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30)

5.6 One continuous: geom_histogram()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 60)

5.7 One continuous: geom_histogram()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(binwidth = 2)

5.8 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 1)

5.9 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 2)

5.10 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 2, 
                    method = "histodot")

5.11 One continuous: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_dotplot(binwidth = 2, 
                    method = "histodot", 
                    stackdir = "center")

5.12 One continuous: geom_density()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_density()

5.13 One continuous: geom_violin()

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_violin(aes(y = 0))

5.14 One continuous: violin + dotplot

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_violin(aes(y = 0)) + 
       geom_dotplot(method = "histodot", 
                    stackdir = "center", 
                    binwidth = 2)

Note

geoms are layered in the order they’re listed. geom_violin() is opaque white (by default), so if you listed geom_dotplot() and then geom_violin(), the dots would be almost completely covered.

5.15 One continuous: stat_qq

ggplot(data = ICU,
       aes(sample = Age)) + 
       stat_qq() +
       stat_qq_line()

6 Plots for two variables

6.1 Two discrete: geom_count()

ggplot(data = ICU, 
       aes(x = Sex, y = Survive)) + 
       geom_count()

6.2 Two discrete: geom_jitter()

ggplot(data = ICU, 
       aes(x = Sex, y = Survive)) + 
       geom_jitter()

6.3 Two discrete: geom_jitter()

ggplot(data = ICU, 
       aes(x = Sex, y = Survive)) + 
       geom_jitter(height = 0.25, 
                   width = 0.25)

6.4 Two continuous: geom_point()

ggplot(data = ICU, 
       aes(x = SysBP, y = Pulse)) + 
       geom_point()

6.5 Two continuous: geom_smooth()

ggplot(data = ICU, 
       aes(x = SysBP, y = Pulse)) + 
       geom_point() + 
       geom_smooth()

6.6 Two continuous: geom_smooth()

ggplot(data = ICU, 
       aes(x = SysBP, y = Pulse)) + 
       geom_point() + 
       geom_smooth(method = "lm", 
                   se = FALSE)

6.7 One of each: geom_col()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_col()

6.8 One of each: geom_boxplot()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_boxplot()

6.9 One of each: geom_dotplot()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_dotplot(method = "histodot",
                    binaxis = "y", 
                    stackdir = "center")

6.10 One of each: geom_violin()

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_violin()

6.11 One of each: violin + dotplot

ggplot(data = ICU, 
       aes(x = Sex, y = Pulse)) + 
       geom_violin() + 
       geom_dotplot(method = "histodot",
                    binaxis = "y", 
                    stackdir = "center")

7 A few more things

7.1 Vertical lines

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30) + 
       geom_vline(xintercept = mean(ICU$Age, 
                                    na.rm = TRUE), 
                  color = "blue", 
                  linewidth = 1.5) + 
       geom_vline(xintercept = median(ICU$Age, 
                                      na.rm = TRUE), 
                  color = "red", 
                  linewidth = 1.5, 
                  linetype = "dashed")

7.2 Colors

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30, 
       color = "black", 
       fill = "royalblue")

  • A good resource on colors in R is here

7.3 Default theme

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30)

ggplot(data = ICU, 
       aes(x = Age)) + 
       geom_histogram(bins = 30) +
       theme_gray()

8 In-class activities

8.1 In-class activities

  • Make plots in ggplot2
  • Select among different plots for the same variable

8.2 Next week

  • Colors and opacity
  • Error bars
  • Annotations (reference lines, cut-offs, text)
  • Changing some common things (themes, labels, re-ordering categories)
  • Complex combined plots (rugs, raincloud plots, index plot)