Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Run models with interaction effects to assess conditional effects of predictors
  • Probe interaction models to fully understand all effects

2 Additive models

2.1 Our models so far

  • Additive effects
    • The effect of two variables is the sum of two individual effects, nothing more
    • Effect of predictor 1: Add that effect
    • Effect of predictor 2: Add that effect
    • Effect of both predictors: Add both effects

2.2 Example data and analysis

library(Stat2Data)
data(ICU)
m1 <- lm(data = ICU, 
         Pulse ~ Emergency + Age)
library(broom)
ICU_aug <- augment(m1)

2.3 Additive effects

summary(m1)

Call:
lm(formula = Pulse ~ Emergency + Age, data = ICU)

Residuals:
   Min     1Q Median     3Q    Max 
-63.10 -18.13  -3.56  17.27  88.73 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 84.91261    7.08132  11.991  < 2e-16 ***
Emergency   11.45223    4.31849   2.652  0.00866 ** 
Age          0.09723    0.09527   1.021  0.30873    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.48 on 197 degrees of freedom
Multiple R-squared:  0.03582,   Adjusted R-squared:  0.02603 
F-statistic: 3.659 on 2 and 197 DF,  p-value: 0.02753

2.4 Constant, additive effects

\[\hat{Pulse} = 84.913 + 11.452 Emergency + 0.097 Age\]

  • Effect of Emergency is the same regardless of value of Age
  • Effect of Age is the same regardless of value of Emergency
    • Same slope wrt Age regardless of value of Emergency
    • Emergency = 0: \(\hat{Pulse} = 84.913 + 0.097 Age\)
    • Emergency = 1: \(\hat{Pulse} = 96.365 + 0.097 Age\)

2.5 Additive effects = parallel lines

Code
ggplot(data = ICU_aug,
       aes(x = Age, y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.4, size = 2) +
  geom_line(aes(y = .fitted, color = as.factor(Emergency)), 
            linewidth = 1) +
  labs(color = "Emergency")

2.6 Assumptions: Residuals

Code
ggplot(data = ICU_aug,
       aes(x = .fitted, 
           y = .std.resid)) +
  geom_point() +
  geom_smooth()

2.7 Assumptions: Residuals

Code
ggplot(data = ICU_aug,
       aes(x = .fitted, 
           y = .std.resid, 
           color = as.factor(Emergency))) +
  geom_point() +
  geom_smooth() +
  labs(color = "Emergency")

2.8 More going on?

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.4, 
             size = 2) +
  geom_smooth(se = FALSE,
              aes(color = as.factor(Emergency))) +
  labs(color = "Emergency")

2.9 More going on?

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.4, 
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE,
              aes(color = as.factor(Emergency))) +
  labs(color = "Emergency")

3 Interactions (moderation)

3.1 Interaction is a product term

\[\hat{Pulse} = b_0 + b_1 Emergency + b_2 Age + b_3 Emergency\times Age\]

  • Emergency and Age have their own effects
  • But they do something special together
  • The effect of one predictor is conditional on (or depends on) the value of the other predictor
  • Non-parallel lines

3.2 Interaction effect

m2 <- lm(data = ICU, Pulse ~ Emergency + Age + Emergency*Age)
summary(m2)

Call:
lm(formula = Pulse ~ Emergency + Age + Emergency * Age, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.557 -16.917  -3.928  16.011  86.799 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   125.6093    15.2350   8.245 2.35e-14 ***
Emergency     -36.0513    16.3861  -2.200  0.02897 *  
Age            -0.5409     0.2323  -2.329  0.02088 *  
Emergency:Age   0.7612     0.2537   3.001  0.00304 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.96 on 196 degrees of freedom
Multiple R-squared:  0.07817,   Adjusted R-squared:  0.06406 
F-statistic:  5.54 on 3 and 196 DF,  p-value: 0.001137

3.3 Interaction effect

\[\hat{Pulse} = 125.609 + -36.051 Emergency +\] \[-0.541 Age + 0.761 Emergency\times Age\]

  • Simplify for the two Emergency groups
    • Substitute 0 and 1 for Emergency

3.4 Interaction effect

  • Emergency = 0
    • \(\hat{Pulse} = 125.609 + -36.051 (0) + -0.541 Age + 0.761 (0)\times Age\)
    • \(\hat{Pulse} = 125.609 + -0.541 Age\)
  • Emergency = 1
    • \(\hat{Pulse} = 125.609 + -36.051 (1) + -0.541 Age + 0.761 (1)\times Age\)
    • \(\hat{Pulse} = (125.609 -36.051) + (-0.541 + 0.761) Age\)
    • \(\hat{Pulse} = 89.558 + 0.22 Age\)

3.5 What do all the coefficients mean?

\[\hat{Pulse} = 125.609 + -36.051 Emergency +\] \[-0.541 Age + 0.761 Emergency\times Age\]

  • Intercept: 125.609
    • Expected Pulse when all predictors are equal to 0
    • Emergency = 0 and Age = 0
    • Pulse for non-emergency admit newborn

3.6 What do all the coefficients mean?

\[\hat{Pulse} = 125.609 + -36.051 Emergency +\] \[-0.541 Age + 0.761 Emergency\times Age\]

  • Emergency effect = -36.051
    • Difference in Pulse between emergency and non-emergency admits when Age equals 0
    • In this case, an intercept difference

3.7 What do all the coefficients mean?

\[\hat{Pulse} = 125.609 + -36.051 Emergency +\] \[-0.541 Age + 0.761 Emergency\times Age\]

  • Age effect = -0.541
    • Expected change in Pulse for a 1 unit change in Age when Emergency equals 0
    • Slope wrt Age for Emergency = 0 group

3.8 What do all the coefficients mean?

\[\hat{Pulse} = 125.609 + -36.051 Emergency +\] \[-0.541 Age + 0.761 Emergency\times Age\]

  • Interaction = 0.761
    • Difference in Age slopes between emergency and non admits
    • Significant interaction = slopes are different
    • Not parallel lines

3.9 Plot of interaction

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.3, 
             size = 2) +
  geom_abline(slope = -0.541, 
              intercept = 125.609, 
              linewidth = 1,
              color = "#F8766D") +
  geom_abline(slope = 0.22, 
              intercept = 89.558,
              linewidth = 1,
              color = "#00BA38") +
  xlim(0, 90) +
  theme(legend.position="none")

\(\hat{Pulse} = 125.609 + -36.051 Emergency + -0.541 Age + 0.761 Emergency\times Age\)

3.10 Plot of interaction

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.3, 
             size = 2) +
  geom_abline(slope = -0.541, 
              intercept = 125.609, 
              linewidth = 1,
              color = "#F8766D") +
  geom_abline(slope = 0.22, 
              intercept = 89.558,
              linewidth = 1,
              color = "#00BA38") +
  xlim(0, 90) +
  theme(legend.position="none") +
  annotate("pointrange", 
           x = 0, y = 125.609, 
           ymin = 125.609, ymax = 125.609, 
           size = 1) +
  annotate("text", x = 0, y = 135, label = "125.609", size = 6)

\(\hat{Pulse} = 125.609 + -36.051 Emergency + -0.541 Age + 0.761 Emergency\times Age\)

3.11 Plot of interaction

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.3, 
             size = 2) +
  geom_abline(slope = -0.541, 
              intercept = 125.609, 
              linewidth = 1,
              color = "#F8766D") +
  geom_abline(slope = 0.22, 
              intercept = 89.558,
              linewidth = 1,
              color = "#00BA38") +
  xlim(0, 90) +
  theme(legend.position="none") +
  annotate("pointrange", 
           x = 0, y = 125.609, 
           ymin = 125.609, ymax = 125.609, 
           size = 1) +
  annotate("text", x = 0, y = 135, label = "125.609", size = 6) +
  geom_segment(x = 0, xend = 0, 
               y = 89.558, yend = 125.609,
               linewidth = 1, linetype = "dashed") +
  annotate("text", x = 5, y = 105, label = "-36.051", size = 6)

\(\hat{Pulse} = 125.609 + -36.051 Emergency + -0.541 Age + 0.761 Emergency\times Age\)

3.12 Plot of interaction

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.3, 
             size = 2) +
  geom_abline(slope = -0.541, 
              intercept = 125.609, 
              linewidth = 1,
              color = "#F8766D") +
  geom_abline(slope = 0.22, 
              intercept = 89.558,
              linewidth = 1,
              color = "#00BA38") +
  xlim(0, 90) +
  theme(legend.position="none") +
  annotate("pointrange", 
           x = 0, y = 125.609, 
           ymin = 125.609, ymax = 125.609, 
           size = 1) +
  annotate("text", x = 0, y = 135, label = "125.609", size = 6) +
  geom_segment(x = 0, xend = 0, 
               y = 89.558, yend = 125.609,
               linewidth = 1, linetype = "dashed") +
  annotate("text", x = 5, y = 105, label = "-36.051", size = 6) +
  geom_abline(intercept = 125.609, slope = -0.541,
              linetype = "dashed", linewidth = 1) +
  annotate("text", x = 25, y = 125, label = "-0.541", size = 6)

\(\hat{Pulse} = 125.609 + -36.051 Emergency + -0.541 Age + 0.761 Emergency\times Age\)

3.13 Plot of interaction

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.3, 
             size = 2) +
  geom_abline(slope = -0.541, 
              intercept = 125.609, 
              linewidth = 1,
              color = "#F8766D") +
  geom_abline(slope = 0.22, 
              intercept = 89.558,
              linewidth = 1,
              color = "#00BA38") +
  xlim(0, 90) +
  theme(legend.position="none") +
  annotate("pointrange", 
           x = 0, y = 125.609, 
           ymin = 125.609, ymax = 125.609, 
           size = 1) +
  annotate("text", x = 0, y = 135, label = "125.609", size = 6) +
  geom_segment(x = 0, xend = 0, 
               y = 89.558, yend = 125.609,
               linewidth = 1, linetype = "dashed") +
  annotate("text", x = 5, y = 105, label = "-36.051", size = 6) +
  geom_abline(intercept = 125.609, slope = -0.541,
              linetype = "dashed", linewidth = 1) +
  annotate("text", x = 25, y = 125, label = "-0.541", size = 6) +
  geom_abline(intercept = 89.558, slope = 0.22,
              linetype = "dashed", linewidth = 1) +
  geom_curve(x = 80, xend = 80, 
             y = 82.329, yend = 107.158,
             linewidth = 1, linetype = "dashed",
             arrow = arrow(length = unit(0.3, "cm"), 
                           type = "closed")) +
  annotate("text", x = 87, y = 95, label = "0.761", size = 6)

\(\hat{Pulse} = 125.609 + -36.051 Emergency + -0.541 Age + 0.761 Emergency\times Age\)

3.14 Center continuous predictors

  • Recommended to mean center continuous predictors
    • Interpretable intercept
    • Interpretable lower order effects
    • Reduces collinearity between interaction and other predictors
  • Use centered predictor in all places in the model

3.15 Center continuous predictors

ICU <- ICU %>% mutate(AgeC = Age - mean(Age, na.rm = TRUE))
m3 <- lm(data = ICU, Pulse ~ Emergency + AgeC + Emergency*AgeC)
summary(m3)

Call:
lm(formula = Pulse ~ Emergency + AgeC + Emergency * AgeC, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.557 -16.917  -3.928  16.011  86.799 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     94.4823     3.8476  24.556  < 2e-16 ***
Emergency        7.7539     4.4091   1.759  0.08020 .  
AgeC            -0.5409     0.2323  -2.329  0.02088 *  
Emergency:AgeC   0.7612     0.2537   3.001  0.00304 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.96 on 196 degrees of freedom
Multiple R-squared:  0.07817,   Adjusted R-squared:  0.06406 
F-statistic:  5.54 on 3 and 196 DF,  p-value: 0.001137

3.16 What changes?

  • Lower order terms because “…when Age = 0”
    • Intercept: 125.609 \(\rightarrow\) 94.482
      • Now at mean of Age instead of Age = 0
    • Effect of Emergency: -36.051 \(\rightarrow\) 7.754
      • Now at mean of Age instead of Age = 0
  • Collinearity
    • Standard errors are reduced with centering for some terms

3.17 Collinearity before and after centering

library(car)
vif(m2)
    Emergency           Age Emergency:Age 
    15.525294      6.408342     17.562327 
vif(m3)
     Emergency           AgeC Emergency:AgeC 
      1.124039       6.408342       6.203115 

3.18 What doesn’t change?

  • Higher order terms
    • Interaction: 0.761
    • Effect of Age (slope): -0.541
  • \(R^2_{multiple}\) and associated tests
  • Log-likelihood

3.19 Uncentered Age

Code
ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.4, 
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE,
              aes(color = as.factor(Emergency))) +
  labs(color = "Emergency")

3.20 Centered Age

Code
ggplot(data = ICU,
       aes(x = AgeC, 
           y = Pulse, 
           group = Emergency)) +
  geom_point(aes(color = as.factor(Emergency)), 
             alpha = 0.4, 
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE,
              aes(color = as.factor(Emergency))) +
  labs(color = "Emergency")

3.21 Probing the interaction

  • Simple slopes: Slopes at specific values
    • Emergency = 0: \(\hat{Pulse} = 125.609 + -0.541 Age\)
    • Emergency = 1: \(\hat{Pulse} = 89.558 + 0.22 Age\)
  • We know (from the original analysis)
    • Significance of coefficients for “group = 0”
    • Significance of differences between the groups
  • What about “group = 1”?

3.22 Two approaches

  • Method 1: Recode your group variable from 0,1 to 1,0
    • Now the group that was 1 is now 0
    • Interpret the intercept and slope for that group
  • Method 2: Calculate things
    • Uses the variances and covariance of coefficients (vcov())
    • See Aiken & West (1991), Preacher, Curran, Bauer (2006)

3.23 Packages to make this easier for you

3.24 Plot simple slopes

library(interactions)
interact_plot(m3, 
              pred = AgeC, 
              modx = Emergency)

3.25 Plot simple slopes with data

interact_plot(m3, 
              pred = AgeC, 
              modx = Emergency, 
              plot.points = TRUE)

3.26 Simple slopes analysis

sim_slopes(m3, 
           pred = AgeC, 
           modx = Emergency, 
           johnson_neyman = FALSE)
SIMPLE SLOPES ANALYSIS

Slope of AgeC when Emergency = 0.00 (0): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.54   0.23    -2.33   0.02

Slope of AgeC when Emergency = 1.00 (1): 

  Est.   S.E.   t val.      p
------ ------ -------- ------
  0.22   0.10     2.16   0.03

3.27 Johnson-Neyman intervals

  • For which values of AgeC are the Emergency groups different?
    • Not about the slopes but about the values
  • Johnson, P. O., & Fay, L. C. (1950). The Johnson-Neyman technique, its theory and application. Psychometrika, 15, 349–367.

3.28 Johnson-Neyman intervals

sim_slopes(m3, 
           pred = AgeC, 
           modx = Emergency, 
           johnson_neyman = TRUE)
JOHNSON-NEYMAN INTERVAL

When Emergency is OUTSIDE the interval [0.26, 0.97], the slope of AgeC is p
< .05.

Note: The range of observed values of Emergency is [0.00, 1.00]

SIMPLE SLOPES ANALYSIS

Slope of AgeC when Emergency = 0.00 (0): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -0.54   0.23    -2.33   0.02

Slope of AgeC when Emergency = 1.00 (1): 

  Est.   S.E.   t val.      p
------ ------ -------- ------
  0.22   0.10     2.16   0.03

3.29 Johnson-Neyman intervals: Improved

sim_slopes(m3, 
           pred = Emergency, 
           modx = AgeC, 
           johnson_neyman = TRUE)
JOHNSON-NEYMAN INTERVAL

When AgeC is OUTSIDE the interval [-44.31, 1.06], the slope of Emergency is
p < .05.

Note: The range of observed values of AgeC is [-41.55, 34.45]

SIMPLE SLOPES ANALYSIS

Slope of Emergency when AgeC = -2.005465e+01 (- 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  -7.51   7.61    -0.99   0.32

Slope of Emergency when AgeC = -1.705303e-15 (Mean): 

  Est.   S.E.   t val.      p
------ ------ -------- ------
  7.75   4.41     1.76   0.08

Slope of Emergency when AgeC =  2.005465e+01 (+ 1 SD): 

   Est.   S.E.   t val.      p
------- ------ -------- ------
  23.02   5.73     4.02   0.00

3.30 Johnson-Neyman intervals: Plot

johnson_neyman(m3, 
               pred = Emergency, 
               modx = AgeC, 
               alpha = .05)

JOHNSON-NEYMAN INTERVAL

When AgeC is OUTSIDE the interval [-44.31, 1.06], the slope of Emergency is
p < .05.

Note: The range of observed values of AgeC is [-41.55, 34.45]

4 Miscellaneous

4.1 Main effects vs simple effects

  • Factorial ANOVA: Main effects + interaction
    • Main effect: Effect of a predictor averaged across the levels of the other predictors
    • Overall effect
  • Linear regression: Simple (main) effects + interaction
    • Simple effect: Effect of a predictor at a specific value of the other predictors
    • Conditional effect

4.2 Main effects vs simple effects

  • In general: Don’t use the term “main effects” for regression
    • “Simple slopes” for the actual simple slopes
    • “Lower order effects” for the coefficients (i.e., \(b_1\) and \(b_2\))
  • If interaction is significant
    • Lower order effects are conditional, so make sure to talk about them accordingly

4.3 Two continuous predictors

ICU <- ICU %>% mutate(SysBPC = SysBP - mean(SysBP, na.rm = TRUE))
m4 <- lm(data = ICU, Pulse ~ AgeC + SysBPC + AgeC*SysBPC)
summary(m4)

Call:
lm(formula = Pulse ~ AgeC + SysBPC + AgeC * SysBPC, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.441 -18.022  -3.145  19.010  93.346 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 98.977148   1.908781  51.854   <2e-16 ***
AgeC         0.049134   0.095734   0.513    0.608    
SysBPC      -0.037546   0.061418  -0.611    0.542    
AgeC:SysBPC -0.001862   0.003779  -0.493    0.623    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.95 on 196 degrees of freedom
Multiple R-squared:  0.006018,  Adjusted R-squared:  -0.009196 
F-statistic: 0.3955 on 3 and 196 DF,  p-value: 0.7563

4.4 Simple slopes

interact_plot(m4, 
              pred = AgeC, 
              modx = SysBPC, 
              plot.points = TRUE)

5 Next time

5.1 Non-normal outcomes

  • Logistic regression for binary outcomes
  • Poisson regressoin for count outcomes
  • Survival analysis for survival outcomes