Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

Add predictors to a linear regression model
Interpret partial regression coefficients
Re-code predictors to meet assumptions
Compare different models using \(R^2_{change}\)

2 Adding predictors

2.1 Review: Linear regression

Linear regression
- One continuous outcome variable
- One or more predictor variable

2.2 Bivariate regression

One predictor, one outcome
- Slope of line relating predictor and outcome
  - Expected change in \(Y\) for 1 unit change in \(X\)
- If predictor and outcome are standardized
  - Regression coefficient = correlation between \(X\) and \(Y\)

2.3 Multiple regression

With multiple predictors
- Correlations between predictors affect regression coefficients
- Not just how \(X\) relates to \(Y\)
  - How does \(X\) relate to other predictors?
Interpretation changes
- Subtle but important

2.4 Correlations and partial correlations

“Correlation” (bivariate correlation)
- Ignore the effect of other predictors
Partial correlation
- “Control for” or “partial out” the effect of other predictors
- Other variables can be related to \(X\), \(Y\), or both

2.5 Three variables

Three variables that are all related to each other

2.6 Bivariate correlation

Correlation between \(X_1\) and \(Y\), ignoring \(X_2\)

2.7 Partial correlation

Correlation between \(X_1\) and \(Y\), removing the effect of \(X_2\)

2.8 Uncorrelated predictors

Correlation between \(X_1\) and \(Y\), no effect of \(X_2\)

2.9 Example data

ICU data from the Stat2Data package

library(Stat2Data)
data(ICU)
head(ICU)

  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

2.10 Correlations between variables

ICU_sub <- ICU[, c(3, 8, 9)]
cor(ICU_sub)

                  Age      Pulse  Emergency
Age        1.00000000 0.03736843 -0.1869571
Pulse      0.03736843 1.00000000  0.1752685
Emergency -0.18695714 0.17526851  1.0000000

2.11 Three models

Pulse ~ Age

pulse_vs_age <- lm(data = ICU, 
                   Pulse ~ Age)

Pulse ~ Emergency

pulse_vs_ER <- lm(data = ICU, 
                  Pulse ~ Emergency)

Pulse ~ Age + Emergency

pulse_vs_age_ER <- lm(data = ICU, 
                      Pulse ~ Age + Emergency)

2.12 Compare models

library(modelsummary)
modelsummary(list(pulse_vs_age, pulse_vs_ER, pulse_vs_age_ER))

	(1)	(2)	(3)
(Intercept)	96.048	91.113	84.913
	(5.788)	(3.637)	(7.081)
Age	0.050		0.097
	(0.095)		(0.095)
Emergency		10.628	11.452
		(4.243)	(4.318)
Num.Obs.	200	200	200
R2	0.001	0.031	0.036
R2 Adj.	-0.004	0.026	0.026
AIC	1888.1	1882.1	1883.1
BIC	1898.0	1892.0	1896.3
Log.Lik.	-941.048	-938.068	-937.540
F	0.277	6.275	3.659
RMSE	26.74	26.35	26.28

2.13 Just the two predictor model

summary(pulse_vs_age_ER)


Call:
lm(formula = Pulse ~ Age + Emergency, data = ICU)

Residuals:
   Min     1Q Median     3Q    Max 
-63.10 -18.13  -3.56  17.27  88.73 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 84.91261    7.08132  11.991  < 2e-16 ***
Age          0.09723    0.09527   1.021  0.30873    
Emergency   11.45223    4.31849   2.652  0.00866 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.48 on 197 degrees of freedom
Multiple R-squared:  0.03582,   Adjusted R-squared:  0.02603 
F-statistic: 3.659 on 2 and 197 DF,  p-value: 0.02753

2.14 Interpretation

\(\hat{Pulse} = 84.913 + 0.097 * Age + 11.452 * Emergency\)
The intercept (\(b_0\)) was 84.913, \(t(1)\) = 11.991, \(p\) < .0001. An individual who was 0 years old and had a non-emergency admission is expected to have a pulse rate of 84.913.
- \(\hat{Pulse} = 84.913 + 0.097 * 0 + 11.452 * 0 = 84.913\)

2.15 Interpretation

The slope for age (\(b_1\)) was 0.097, \(t(1)\) = 1.021, \(p\) > .05. For each 1 year increase in age, the expected pulse rate increased by 0.097 points, holding admission type constant; this effect was not significant.
The slope for admission type (\(b_2\)) was 11.452, \(t(1)\) = 2.652, \(p\) < .01. The difference in expected pulse rate between non-emergency admissions (0) and emergency admissions (1) was 11.452 points, holding age constant.

2.16 Interpretation

The \(R^2\) for this model was 0.036, F(2, 197) = 3.659, p < .05. About 3.6% of the variance in pulse rate was explained by age and admission type.
- \(F\)-test for \(R^2_{multiple}\) has two different degrees of freedom
- Numerator df: # of predictors = 2
- Denominator df: \(N\) - # of predictors - 1 = 200 - 2 - 1 = 197

2.17 Two regression lines

Simplify regression equation for each Emergency group
- Emergency = 0
  - \(\hat{Pulse} = 84.913 + 0.097 * Age + 11.452 * 0\)
  - \(\hat{Pulse} = 84.913 + 0.097 * Age\)
- Emergency = 1
  - \(\hat{Pulse} = 84.913 + 0.097 * Age + 11.452 * 1\)
  - \(\hat{Pulse} = 96.365 + 0.097 * Age\)

2.18 Two parallel lines

ggplot(data = ICU,
       aes(x = Age, 
           y = Pulse, 
           color = as.factor(Emergency))) +
  geom_point(alpha = 0.4,
             size = 2) +
  geom_abline(intercept = 84.913, 
              slope = 0.097, 
              color="#F8766D",
              size = 1) +
  geom_abline(intercept = 96.365, 
              slope = 0.097, 
              color="#00BA38",
              size = 1)

Same slope for each line: 0.097
Same difference between lines at any Age: 11.452

2.19 Check assumptions

With more than 1 predictor
- Residual vs predictor 1
- Residual vs predictor 2
- Residual vs predicted outcome value (combination of predictor 1 and 2)

ICU$pred <- fitted(pulse_vs_age_ER)
ICU$resi <- residuals(pulse_vs_age_ER)

2.20 Residual vs `Age`

ggplot(data = ICU, 
       aes(x = Age, y = resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

2.21 Residual vs `Emergency`

ggplot(data = ICU, 
       aes(x = Emergency, y = resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

2.22 Residual vs predicted `Pulse`

ggplot(data = ICU, 
       aes(x = pred, y = resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

2.23 Distribution of residuals

ggplot(data = ICU, 
       aes(x = resi)) + 
  geom_histogram()

3 Coding predictors

3.1 Binary predictors

Simplest coding scheme is 0,1
- Intercept = expected outcome value for group coded zero
Other options
- 1,2 or -1,+1
- Intercept is less interpretable

3.2 3 or more categories

Dummy codes: Extension of 0,1 scheme to more categories
- Also called “indicator variables”
Note: 2 categories \(\rightarrow\) 1 indicator variable
- Always use \(a - 1\) variables for \(a\) categories

3.3 3 or more categories

AgeGroup in the ICU data frame has 3 categories
- 1 = young (under 50)
- 2 = middle (50-69)
- 3 = old (70+)
Create 3 indicator variables (dummy codes)
- Value of 1 for a category, 0s for all other categories

3.4 3 or more categories

Code

ICU <- ICU %>% mutate(d1 = ifelse(AgeGroup == 1, 1, 0),
                      d2 = ifelse(AgeGroup == 2, 1, 0),
                      d3 = ifelse(AgeGroup == 3, 1, 0))

head(ICU)

  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency      pred
1  4       0  87        3   1         1    80    96         1 104.82370
2  8       1  27        1   1         1   142    88         1  98.99000
3 12       1  59        2   0         0   112    80         1 102.10131
4 14       1  77        3   0         0   100    70         0  92.39919
5 27       0  76        3   1         1   128    90         1 103.75419
6 28       1  54        2   0         1   142   103         1 101.61517
        resi d1 d2 d3
1  -8.823701  0  0  1
2 -10.990001  1  0  0
3 -22.101308  0  1  0
4 -22.399190  0  0  1
5 -13.754189  0  0  1
6   1.384834  0  1  0

3.5 Analysis

Select any two (in general, \(a - 1\)) dummy codes as predictors
- The one that’s left out is the reference group (all 0s)
- All other groups are compared to the reference group

AgeGroup	d1	d2	d3
1	1	0	0
2	0	1	0
3	0	0	1

Leave out d3: People in AgeGroup = 3 have 0s on both d1 and d2

3.6 Analysis

pulse_vs_agegroup <- lm(data = ICU, Pulse ~ d1 + d2)
summary(pulse_vs_agegroup)


Call:
lm(formula = Pulse ~ d1 + d2, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.429 -18.377  -3.429  19.781  92.641 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  99.3594     3.3694  29.489   <2e-16 ***
d1           -1.5628     4.8650  -0.321    0.748    
d2            0.0692     4.5595   0.015    0.988    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.96 on 197 degrees of freedom
Multiple R-squared:  0.000745,  Adjusted R-squared:  -0.0094 
F-statistic: 0.07344 on 2 and 197 DF,  p-value: 0.9292

3.7 Interpretation

Intercept: Expected Pulse for reference group (AgeGroup = 3)
Effect of d1: Expected difference between the reference group (AgeGroup = 3) and the group coded 1 on d1 (AgeGroup = 1)
Effect of d2: Expected difference between the reference group (AgeGroup = 3) and the group coded 1 on d2 (AgeGroup = 2)

3.8 Interpretation

Pulse for AgeGroup = 3 is the intercept
- 99.359
Pulse for AgeGroup = 1 is 1.563 points lower (negative) than the value for AgeGroup = 3
- 99.359 - 1.563 = 97.796
Pulse for AgeGroup = 2 is 0.069 points higher (positive) than the value for AgeGroup = 3
- 99.359 + 0.069 = 99.428

3.9 Interpretation: Figure

ggplot(data = ICU,
       aes(x = AgeGroup, y = Pulse)) +
  geom_point() +
  geom_hline(yintercept = 99.359,
             color = "black") + #3
  geom_hline(yintercept = 97.796,
             color = "blue") + #1
  geom_hline(yintercept = 99.428,
             color = "red",
             linetype = "dashed") + #2
  coord_flip()

4 Centering predictors

4.1 What is an intercept?

The expected value of the outcome when all predictors are equal to 0
- (We just saw how this worked for dummy codes)
- 0 may not exist for some predictors
- 0 may be a non-meaningful value
How can we make the 0 value more meaningful?

4.2 Centering predictors

Move the 0 point to a new value
Typically mean centering
- But you can choose other (meaningful) values
- Median, percentile, cut-off, etc.
Subtract this value (e.g., mean) from all predictor values
Mean of original variable = 0 in new variable
- 0 is now a meaningful value, so the intercept is interpretable

4.3 Example: Mean center `Age`

mean(ICU$Age, na.rm = TRUE)

[1] 57.545

ICU$AgeC = ICU$Age - mean(ICU$Age, na.rm = TRUE)
head(ICU)

  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency      pred
1  4       0  87        3   1         1    80    96         1 104.82370
2  8       1  27        1   1         1   142    88         1  98.99000
3 12       1  59        2   0         0   112    80         1 102.10131
4 14       1  77        3   0         0   100    70         0  92.39919
5 27       0  76        3   1         1   128    90         1 103.75419
6 28       1  54        2   0         1   142   103         1 101.61517
        resi d1 d2 d3    AgeC
1  -8.823701  0  0  1  29.455
2 -10.990001  1  0  0 -30.545
3 -22.101308  0  1  0   1.455
4 -22.399190  0  0  1  19.455
5 -13.754189  0  0  1  18.455
6   1.384834  0  1  0  -3.545

4.4 Model with centered predictor

pulse_vs_ageC <- lm(data = ICU, Pulse ~ AgeC)
summary(pulse_vs_ageC)


Call:
lm(formula = Pulse ~ AgeC, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.998 -18.098  -4.123  19.452  92.402 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 98.92500    1.90060  52.049   <2e-16 ***
AgeC         0.04999    0.09501   0.526    0.599    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.88 on 198 degrees of freedom
Multiple R-squared:  0.001396,  Adjusted R-squared:  -0.003647 
F-statistic: 0.2769 on 1 and 198 DF,  p-value: 0.5993

4.5 Model with centered predictor

ggplot(data = ICU, 
       aes(x = AgeC, y = Pulse)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  annotate("pointrange", x = 0, 
           y = 98.925, 
           ymin = 98.925, 
           ymax = 98.925,
           colour = "red", 
           size = 2)

\(b_0\) = 98.925: Predicted Pulse at mean Age

4.6 Other implications

Centering predictors doesn’t change the slopes, only intercepts
Centering predictors doesn’t change the \(R^2_{multiple}\)
Interactions / moderation:
- Conditional effects: Effect of one predictor depends on other predictors
- Specifically, conditional on predictors being equal to 0
  - We’ll care where 0 is

5 Transforming predictors

5.1 “Linear in the predictors”

\(\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots\)
- Predicted value is a linear combination of regression coefficients and predictors
In practice, this looks like a linear relationship between \(X\) and \(Y\)
- What if there’s not a straight line relationship?

5.2 Gapminder data

library(gapminder)
data(gapminder)
ggplot(data = gapminder,
       aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth()

5.3 Bad model

bad_gdp <- lm(data = gapminder, lifeExp ~ gdpPercap)
gapminder$bad_pred <- fitted(bad_gdp)
gapminder$bad_resi <- residuals(bad_gdp)
summary(bad_gdp)


Call:
lm(formula = lifeExp ~ gdpPercap, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-82.754  -7.758   2.176   8.225  18.426 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.396e+01  3.150e-01  171.29   <2e-16 ***
gdpPercap   7.649e-04  2.579e-05   29.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.49 on 1702 degrees of freedom
Multiple R-squared:  0.3407,    Adjusted R-squared:  0.3403 
F-statistic: 879.6 on 1 and 1702 DF,  p-value: < 2.2e-16

5.4 Bad model: Fitted line

ggplot(data = gapminder, 
       aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm")

5.5 Bad model: Residual vs predictor

ggplot(data = gapminder, 
       aes(x = gdpPercap, y = bad_resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

5.6 Bad model: Residual distribution

ggplot(data = gapminder, 
       aes(x = bad_resi)) + 
  geom_histogram()

5.7 Transforming predictors

It’s generally ok to transform predictors to make this work
- Log transformation for very skewed predictors (e.g., GDP, income)
- Add squared term (\(X^2\)) for curvilinear effects
How do you know what transformation to use?
- Literature
- Plots
Note: Do not transform outcomes

5.8 Example: Transform predictors

gapminder <- gapminder %>% mutate(loggdpPercap = log(gdpPercap))
ggplot(data = gapminder,
       aes(x = loggdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth()

5.9 Better model

good_gdp <- lm(data = gapminder, lifeExp ~ loggdpPercap)
gapminder$good_pred <- fitted(good_gdp)
gapminder$good_resi <- residuals(good_gdp)
summary(good_gdp)


Call:
lm(formula = lifeExp ~ loggdpPercap, data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.778  -4.204   1.212   4.658  19.285 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -9.1009     1.2277  -7.413 1.93e-13 ***
loggdpPercap   8.4051     0.1488  56.500  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.62 on 1702 degrees of freedom
Multiple R-squared:  0.6522,    Adjusted R-squared:  0.652 
F-statistic:  3192 on 1 and 1702 DF,  p-value: < 2.2e-16

5.10 Better model: Fitted line

ggplot(data = gapminder, 
       aes(x = loggdpPercap, y = lifeExp)) +
  geom_point() +
  geom_smooth(method = "lm")

5.11 Better model: Residual vs predictor

ggplot(data = gapminder, 
       aes(x = loggdpPercap, y = good_resi)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

5.12 Better model: Residual distribution

ggplot(data = gapminder, 
       aes(x = good_resi)) + 
  geom_histogram()

6 Comparing models

6.1 Comparing models

modelsummary(list(pulse_vs_age, pulse_vs_ER, pulse_vs_age_ER))

	(1)	(2)	(3)
(Intercept)	96.048	91.113	84.913
	(5.788)	(3.637)	(7.081)
Age	0.050		0.097
	(0.095)		(0.095)
Emergency		10.628	11.452
		(4.243)	(4.318)
Num.Obs.	200	200	200
R2	0.001	0.031	0.036
R2 Adj.	-0.004	0.026	0.026
AIC	1888.1	1882.1	1883.1
BIC	1898.0	1892.0	1896.3
Log.Lik.	-941.048	-938.068	-937.540
F	0.277	6.275	3.659
RMSE	26.74	26.35	26.28

6.2 \(R^2_{change}\)

Change in \(R^2\) between models
- Test whether that change is significant
- Is more predictors worth a more complex model?
Only for nested models
- Same model, just add a new predictor
- Can’t compare e.g., untransformed to transformed model

6.3 Test of \(R^2_{change}\)

anova(pulse_vs_age, pulse_vs_age_ER)

Analysis of Variance Table

Model 1: Pulse ~ Age
Model 2: Pulse ~ Age + Emergency
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1    198 143046                                
2    197 138115  1    4930.5 7.0326 0.008656 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This test is significant, so
- Yes, adding Emergency to a model with Age significantly increases the \(R^2\) for the model from .001 to .036; F(1, 197) = 7.0326, p<.01

6.4 Test of \(R^2_{change}\)

anova(pulse_vs_ER, pulse_vs_age_ER)

Analysis of Variance Table

Model 1: Pulse ~ Emergency
Model 2: Pulse ~ Age + Emergency
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    198 138845                           
2    197 138115  1    730.16 1.0415 0.3087

This test is not significant, so
- No, adding Age to a model with Emergency does not significantly increase the \(R^2\) for the model from .031 to .036; F(1, 197) = 1.0415, p>.05

6.5 AIC and BIC

Use AIC or BIC to compare non-nested models
- No statistical test: Smaller is better
- Penalizes more complex models (i.e., more predictors)
AIC() or BIC() functions from stats package (always loads)

AIC(pulse_vs_age)

[1] 1888.096

AIC(pulse_vs_ER)

[1] 1882.135

AIC(pulse_vs_age_ER)

[1] 1883.081

6.6 AIC for transformed models

AIC(bad_gdp)

[1] 12850.41

AIC(good_gdp)

[1] 11760.42

BIC(bad_gdp)

[1] 12866.73

BIC(good_gdp)

[1] 11776.74

7 Next time

7.1 Expanding the model (more)

Hypothesis testing
More checking assumptions: Residuals and outliers
Estimation of the model
- \(R^2\), likelihood
Interaction (moderator) effects