Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Add predictors to a linear regression model
  • Interpret partial regression coefficients
  • Re-code predictors to meet assumptions
  • Compare different models using \(R^2_{change}\)

2 Adding predictors

2.1 Review: Linear regression

  • Linear regression
    • One continuous outcome variable
    • One or more predictor variable

2.2 Bivariate regression

  • One predictor, one outcome
    • Slope of line relating predictor and outcome
      • Expected change in \(Y\) for 1 unit change in \(X\)
    • If predictor and outcome are standardized
      • Regression coefficient = correlation between \(X\) and \(Y\)

2.3 Multiple regression

  • With multiple predictors
    • Correlations between predictors affect regression coefficients
    • Not just how \(X\) relates to \(Y\)
      • How does \(X\) relate to other predictors?
  • Interpretation changes
    • Subtle but important

2.4 Correlations and partial correlations

  • “Correlation” (bivariate correlation)
    • Ignore the effect of other predictors
  • Partial correlation
    • “Control for” or “partial out” the effect of other predictors
    • Other variables can be related to \(X\), \(Y\), or both

2.5 Three variables

Three variables that are all related to each other

2.6 Bivariate correlation

Correlation between \(X_1\) and \(Y\), ignoring \(X_2\)

2.7 Partial correlation

Correlation between \(X_1\) and \(Y\), removing the effect of \(X_2\)

2.8 Uncorrelated predictors

Correlation between \(X_1\) and \(Y\), no effect of \(X_2\)

2.9 Example data

  • ICU data from the Stat2Data package
library(Stat2Data)
data(ICU)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

2.10 Correlations between variables

ICU_sub <- ICU[, c(3, 8, 9)]
cor(ICU_sub)
                  Age      Pulse  Emergency
Age        1.00000000 0.03736843 -0.1869571
Pulse      0.03736843 1.00000000  0.1752685
Emergency -0.18695714 0.17526851  1.0000000

2.11 Three models

  • Pulse ~ Age
pulse_vs_age <- lm(data = ICU, 
                   Pulse ~ Age)
  • Pulse ~ Emergency
pulse_vs_ER <- lm(data = ICU, 
                  Pulse ~ Emergency)
  • Pulse ~ Age + Emergency
pulse_vs_age_ER <- lm(data = ICU, 
                      Pulse ~ Age + Emergency)

2.12 Compare models

library(modelsummary)
modelsummary(list(pulse_vs_age, pulse_vs_ER, pulse_vs_age_ER))
(1) (2) (3)
(Intercept) 96.048 91.113 84.913
(5.788) (3.637) (7.081)
Age 0.050 0.097
(0.095) (0.095)
Emergency 10.628 11.452
(4.243) (4.318)
Num.Obs. 200 200 200
R2 0.001 0.031 0.036
R2 Adj. -0.004 0.026 0.026
AIC 1888.1 1882.1 1883.1
BIC 1898.0 1892.0 1896.3
Log.Lik. -941.048 -938.068 -937.540
F 0.277 6.275 3.659
RMSE 26.74 26.35 26.28

2.13 Just the two predictor model

summary(pulse_vs_age_ER)

Call:
lm(formula = Pulse ~ Age + Emergency, data = ICU)

Residuals:
   Min     1Q Median     3Q    Max 
-63.10 -18.13  -3.56  17.27  88.73 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 84.91261    7.08132  11.991  < 2e-16 ***
Age          0.09723    0.09527   1.021  0.30873    
Emergency   11.45223    4.31849   2.652  0.00866 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.48 on 197 degrees of freedom
Multiple R-squared:  0.03582,   Adjusted R-squared:  0.02603 
F-statistic: 3.659 on 2 and 197 DF,  p-value: 0.02753

2.14 Interpretation

  • \(\hat{Pulse} = 84.913 + 0.097 * Age + 11.452 * Emergency\)

  • The intercept (\(b_0\)) was 84.913, \(t(1)\) = 11.991, \(p\) < .0001. An individual who was 0 years old and had a non-emergency admission is expected to have a pulse rate of 84.913.

    • \(\hat{Pulse} = 84.913 + 0.097 * 0 + 11.452 * 0 = 84.913\)

2.15 Interpretation

  • The slope for age (\(b_1\)) was 0.097, \(t(1)\) = 1.021, \(p\) > .05. For each 1 year increase in age, the expected pulse rate increased by 0.097 points, holding admission type constant; this effect was not significant.

  • The slope for admission type (\(b_2\)) was 11.452, \(t(1)\) = 2.652, \(p\) < .01. The difference in expected pulse rate between non-emergency admissions (0) and emergency admissions (1) was 11.452 points, holding age constant.

2.16 Interpretation

  • The \(R^2\) for this model was 0.036, F(2, 197) = 3.659, p < .05. About 3.6% of the variance in pulse rate was explained by age and admission type.
    • \(F\)-test for \(R^2_{multiple}\) has two different degrees of freedom
    • Numerator df: # of predictors = 2
    • Denominator df: \(N\) - # of predictors - 1 = 200 - 2 - 1 = 197

2.17 Two regression lines

  • Simplify regression equation for each Emergency group
    • Emergency = 0
      • \(\hat{Pulse} = 84.913 + 0.097 * Age + 11.452 * 0\)
      • \(\hat{Pulse} = 84.913 + 0.097 * Age\)
    • Emergency = 1
      • \(\hat{Pulse} = 84.913 + 0.097 * Age + 11.452 * 1\)
      • \(\hat{Pulse} = 96.365 + 0.097 * Age\)

2.18 Two parallel lines

ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           color = as.factor(Emergency))) +
  geom_point(alpha = 0.4,
             size = 2) +
  geom_abline(intercept = 84.913, 
              slope = 0.097, 
              color="#F8766D",
              size = 1) +
  geom_abline(intercept = 96.365, 
              slope = 0.097, 
              color="#00BA38",
              size = 1)

  • Same slope for each line: 0.097
  • Same difference between lines at any Age: 11.452

2.19 Check assumptions

  • With more than 1 predictor
    • Residual vs predictor 1
    • Residual vs predictor 2
    • Residual vs predicted outcome value (combination of predictor 1 and 2)
ICU$pred <- fitted(pulse_vs_age_ER)
ICU$resi <- residuals(pulse_vs_age_ER)

2.20 Residual vs Age

ggplot(data = ICU, 
       aes(x = Age, y = resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

2.21 Residual vs Emergency

ggplot(data = ICU, 
       aes(x = Emergency, y = resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

2.22 Residual vs predicted Pulse

ggplot(data = ICU, 
       aes(x = pred, y = resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

2.23 Distribution of residuals

ggplot(data = ICU, 
       aes(x = resi)) + 
  geom_histogram()

3 Coding predictors

3.1 Binary predictors

  • Simplest coding scheme is 0,1
    • Intercept = expected outcome value for group coded zero
  • Other options
    • 1,2 or -1,+1
    • Intercept is less interpretable

3.2 3 or more categories

  • Dummy codes: Extension of 0,1 scheme to more categories
    • Also called “indicator variables”
  • Note: 2 categories \(\rightarrow\) 1 indicator variable
    • Always use \(a - 1\) variables for \(a\) categories

3.3 3 or more categories

  • AgeGroup in the ICU data frame has 3 categories
    • 1 = young (under 50)
    • 2 = middle (50-69)
    • 3 = old (70+)
  • Create 3 indicator variables (dummy codes)
    • Value of 1 for a category, 0s for all other categories

3.4 3 or more categories

Code
ICU <- ICU %>% mutate(d1 = ifelse(AgeGroup == 1, 1, 0),
                      d2 = ifelse(AgeGroup == 2, 1, 0),
                      d3 = ifelse(AgeGroup == 3, 1, 0))

head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency      pred
1  4       0  87        3   1         1    80    96         1 104.82370
2  8       1  27        1   1         1   142    88         1  98.99000
3 12       1  59        2   0         0   112    80         1 102.10131
4 14       1  77        3   0         0   100    70         0  92.39919
5 27       0  76        3   1         1   128    90         1 103.75419
6 28       1  54        2   0         1   142   103         1 101.61517
        resi d1 d2 d3
1  -8.823701  0  0  1
2 -10.990001  1  0  0
3 -22.101308  0  1  0
4 -22.399190  0  0  1
5 -13.754189  0  0  1
6   1.384834  0  1  0

3.5 Analysis

  • Select any two (in general, \(a - 1\)) dummy codes as predictors
    • The one that’s left out is the reference group (all 0s)
    • All other groups are compared to the reference group
AgeGroup d1 d2 d3
1 1 0 0
2 0 1 0
3 0 0 1
  • Leave out d3: People in AgeGroup = 3 have 0s on both d1 and d2

3.6 Analysis

pulse_vs_agegroup <- lm(data = ICU, Pulse ~ d1 + d2)
summary(pulse_vs_agegroup)

Call:
lm(formula = Pulse ~ d1 + d2, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.429 -18.377  -3.429  19.781  92.641 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  99.3594     3.3694  29.489   <2e-16 ***
d1           -1.5628     4.8650  -0.321    0.748    
d2            0.0692     4.5595   0.015    0.988    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.96 on 197 degrees of freedom
Multiple R-squared:  0.000745,  Adjusted R-squared:  -0.0094 
F-statistic: 0.07344 on 2 and 197 DF,  p-value: 0.9292

3.7 Interpretation

  • Intercept: Expected Pulse for reference group (AgeGroup = 3)
  • Effect of d1: Expected difference between the reference group (AgeGroup = 3) and the group coded 1 on d1 (AgeGroup = 1)
  • Effect of d2: Expected difference between the reference group (AgeGroup = 3) and the group coded 1 on d2 (AgeGroup = 2)

3.8 Interpretation

  • Pulse for AgeGroup = 3 is the intercept
    • 99.359
  • Pulse for AgeGroup = 1 is 1.563 points lower (negative) than the value for AgeGroup = 3
    • 99.359 - 1.563 = 97.796
  • Pulse for AgeGroup = 2 is 0.069 points higher (positive) than the value for AgeGroup = 3
    • 99.359 + 0.069 = 99.428

3.9 Interpretation: Figure

ggplot(data = ICU,
       aes(x = AgeGroup, y = Pulse)) +
  geom_point() +
  geom_hline(yintercept = 99.359,
             color = "black") + #3
  geom_hline(yintercept = 97.796,
             color = "blue") + #1
  geom_hline(yintercept = 99.428,
             color = "red",
             linetype = "dashed") + #2
  coord_flip()

4 Centering predictors

4.1 What is an intercept?

  • The expected value of the outcome when all predictors are equal to 0
    • (We just saw how this worked for dummy codes)
    • 0 may not exist for some predictors
    • 0 may be a non-meaningful value
  • How can we make the 0 value more meaningful?

4.2 Centering predictors

  • Move the 0 point to a new value
  • Typically mean centering
    • But you can choose other (meaningful) values
    • Median, percentile, cut-off, etc.
  • Subtract this value (e.g., mean) from all predictor values
  • Mean of original variable = 0 in new variable
    • 0 is now a meaningful value, so the intercept is interpretable

4.3 Example: Mean center Age

mean(ICU$Age, na.rm = TRUE)
[1] 57.545
ICU$AgeC = ICU$Age - mean(ICU$Age, na.rm = TRUE)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency      pred
1  4       0  87        3   1         1    80    96         1 104.82370
2  8       1  27        1   1         1   142    88         1  98.99000
3 12       1  59        2   0         0   112    80         1 102.10131
4 14       1  77        3   0         0   100    70         0  92.39919
5 27       0  76        3   1         1   128    90         1 103.75419
6 28       1  54        2   0         1   142   103         1 101.61517
        resi d1 d2 d3    AgeC
1  -8.823701  0  0  1  29.455
2 -10.990001  1  0  0 -30.545
3 -22.101308  0  1  0   1.455
4 -22.399190  0  0  1  19.455
5 -13.754189  0  0  1  18.455
6   1.384834  0  1  0  -3.545

4.4 Model with centered predictor

pulse_vs_ageC <- lm(data = ICU, Pulse ~ AgeC)
summary(pulse_vs_ageC)

Call:
lm(formula = Pulse ~ AgeC, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.998 -18.098  -4.123  19.452  92.402 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 98.92500    1.90060  52.049   <2e-16 ***
AgeC         0.04999    0.09501   0.526    0.599    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.88 on 198 degrees of freedom
Multiple R-squared:  0.001396,  Adjusted R-squared:  -0.003647 
F-statistic: 0.2769 on 1 and 198 DF,  p-value: 0.5993

4.5 Model with centered predictor

ggplot(data = ICU, 
       aes(x = AgeC, y = Pulse)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  annotate("pointrange", x = 0, 
           y = 98.925, 
           ymin = 98.925, 
           ymax = 98.925,
           colour = "red", 
           size = 2)

  • \(b_0\) = 98.925: Predicted Pulse at mean Age

4.6 Other implications

  • Centering predictors doesn’t change the slopes, only intercepts
  • Centering predictors doesn’t change the \(R^2_{multiple}\)
  • Interactions / moderation:
    • Conditional effects: Effect of one predictor depends on other predictors
    • Specifically, conditional on predictors being equal to 0
      • We’ll care where 0 is

5 Transforming predictors

5.1 “Linear in the predictors”

  • \(\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots\)
    • Predicted value is a linear combination of regression coefficients and predictors
  • In practice, this looks like a linear relationship between \(X\) and \(Y\)
    • What if there’s not a straight line relationship?

5.2 Gapminder data

library(gapminder)
data(gapminder)
ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth()

5.3 Bad model

bad_gdp <- lm(data = gapminder, lifeExp ~ gdpPercap)
gapminder$bad_pred <- fitted(bad_gdp)
gapminder$bad_resi <- residuals(bad_gdp)
summary(bad_gdp)

Call:
lm(formula = lifeExp ~ gdpPercap, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-82.754  -7.758   2.176   8.225  18.426 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.49 on 1702 degrees of freedom
Multiple R-squared:  0.3407,    Adjusted R-squared:  0.3403 
F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

5.4 Bad model: Fitted line

ggplot(data = gapminder, 
       aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm")

5.5 Bad model: Residual vs predictor

ggplot(data = gapminder, 
       aes(x = gdpPercap, y = bad_resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

5.6 Bad model: Residual distribution

ggplot(data = gapminder, 
       aes(x = bad_resi)) + 
  geom_histogram()

5.7 Transforming predictors

  • It’s generally ok to transform predictors to make this work
    • Log transformation for very skewed predictors (e.g., GDP, income)
    • Add squared term (\(X^2\)) for curvilinear effects
  • How do you know what transformation to use?
    • Literature
    • Plots
  • Note: Do not transform outcomes

5.8 Example: Transform predictors

gapminder <- gapminder %>% mutate(loggdpPercap = log(gdpPercap))
ggplot(data = gapminder,
       aes(x = loggdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth()

5.9 Better model

good_gdp <- lm(data = gapminder, lifeExp ~ loggdpPercap)
gapminder$good_pred <- fitted(good_gdp)
gapminder$good_resi <- residuals(good_gdp)
summary(good_gdp)

Call:
lm(formula = lifeExp ~ loggdpPercap, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.778  -4.204   1.212   4.658  19.285 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -9.1009     1.2277  -7.413 1.93e-13 ***
loggdpPercap   8.4051     0.1488  56.500  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.62 on 1702 degrees of freedom
Multiple R-squared:  0.6522,    Adjusted R-squared:  0.652 
F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

5.10 Better model: Fitted line

ggplot(data = gapminder, 
       aes(x = loggdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm")

5.11 Better model: Residual vs predictor

ggplot(data = gapminder, 
       aes(x = loggdpPercap, y = good_resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

5.12 Better model: Residual distribution

ggplot(data = gapminder, 
       aes(x = good_resi)) + 
  geom_histogram()

6 Comparing models

6.1 Comparing models

modelsummary(list(pulse_vs_age, pulse_vs_ER, pulse_vs_age_ER))
(1) (2) (3)
(Intercept) 96.048 91.113 84.913
(5.788) (3.637) (7.081)
Age 0.050 0.097
(0.095) (0.095)
Emergency 10.628 11.452
(4.243) (4.318)
Num.Obs. 200 200 200
R2 0.001 0.031 0.036
R2 Adj. -0.004 0.026 0.026
AIC 1888.1 1882.1 1883.1
BIC 1898.0 1892.0 1896.3
Log.Lik. -941.048 -938.068 -937.540
F 0.277 6.275 3.659
RMSE 26.74 26.35 26.28

6.2 \(R^2_{change}\)

  • Change in \(R^2\) between models
    • Test whether that change is significant
    • Is more predictors worth a more complex model?
  • Only for nested models
    • Same model, just add a new predictor
    • Can’t compare e.g., untransformed to transformed model

6.3 Test of \(R^2_{change}\)

anova(pulse_vs_age, pulse_vs_age_ER)
Analysis of Variance Table

Model 1: Pulse ~ Age
Model 2: Pulse ~ Age + Emergency
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1    198 143046                                
2    197 138115  1    4930.5 7.0326 0.008656 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • This test is significant, so
    • Yes, adding Emergency to a model with Age significantly increases the \(R^2\) for the model from .001 to .036; F(1, 197) = 7.0326, p<.01

6.4 Test of \(R^2_{change}\)

anova(pulse_vs_ER, pulse_vs_age_ER)
Analysis of Variance Table

Model 1: Pulse ~ Emergency
Model 2: Pulse ~ Age + Emergency
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    198 138845                           
2    197 138115  1    730.16 1.0415 0.3087
  • This test is not significant, so
    • No, adding Age to a model with Emergency does not significantly increase the \(R^2\) for the model from .031 to .036; F(1, 197) = 1.0415, p>.05

6.5 AIC and BIC

  • Use AIC or BIC to compare non-nested models
    • No statistical test: Smaller is better
    • Penalizes more complex models (i.e., more predictors)
  • AIC() or BIC() functions from stats package (always loads)
AIC(pulse_vs_age)
[1] 1888.096
AIC(pulse_vs_ER)
[1] 1882.135
AIC(pulse_vs_age_ER)
[1] 1883.081

6.6 AIC for transformed models

AIC(bad_gdp)
[1] 12850.41
AIC(good_gdp)
[1] 11760.42
BIC(bad_gdp)
[1] 12866.73
BIC(good_gdp)
[1] 11776.74

7 Next time

7.1 Expanding the model (more)

  • Hypothesis testing
  • More checking assumptions: Residuals and outliers
  • Estimation of the model
    • \(R^2\), likelihood
  • Interaction (moderator) effects