Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Describe the logic of linear regression
  • Describe the assumptions of linear regression
  • Briefly present results of linear regression
  • Use a regression equation to summarize a model
  • Compare and contrast observed, predicted, and residual values

2 Motivation

2.1 Linear regression

  • General linear model or GLM
    • Includes linear regression, ANOVA, \(t\)-tests
  • One dependent variable or outcome
    • Predicted by 1 or more continuous or categorical independent variables or predictors
  • Causality not assumed, but sometimes (wrongly) implied

2.2 Purposes of linear regression

  • Inference
    • Confirm effect of predictor(s) on outcome
  • Prediction
    • Predict outcome value based on predictor value(s)
  • Variance explained
    • How much of the variance in the outcome is explained by the predictor(s)?

2.3 Assumptions

  • One continuous outcome variable
  • One or more predictor variable (nominal, ordinal, interval, ratio)
  • Relationship is linear in the predictors
  • Observations are independent (uncorrelated)
  • Residuals are conditionally normally distributed with constant variance (homoscedasticity)

3 Example

3.1 Data

  • ICU data from the Stat2Data package
    • ID: Patient ID code
    • Survive: 1 = patient survived to discharge or 0 = patient died
    • Age: Age (in years)
    • AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
    • Sex: 1 = female or 0 = male
    • Infection: 1 = infection suspected or 0 = no infection
    • SysBP: Systolic blood pressure (in mm of Hg)
    • Pulse: Heart rate (beats per minute)
    • Emergency: 1 = emergency admission or 0 = elective admission

3.2 Data

library(Stat2Data)
data(ICU)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

3.3 Relations between variables

  • Y versus X
  • Y as a function of X
  • Y ~ X
  • X predicts Y

3.4 Example 1: Pulse vs Age

  • Does Pulse depend on Age?
ggplot(data = ICU, 
       aes(x = Age, y = Pulse)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm",
              se = TRUE)

3.5 Example 2: Pulse vs Emergency

  • Do emergency and non-emergency admits have similar Pulse rates?
ggplot(data = ICU, 
       aes(x = Emergency, y = Pulse)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm", 
              se = TRUE)

4 Inference

4.1 lm() function

  • lm() function in stats package (loaded by default)
  • Inputs
    • data: data frame
    • formula: relationship to model (Y ~ X)
  • Outputs

4.2 Example 1: Pulse vs Age

pulse_vs_age <- lm(data = ICU, Pulse ~ Age)
summary(pulse_vs_age)

Call:
lm(formula = Pulse ~ Age, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.998 -18.098  -4.123  19.452  92.402 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 96.04818    5.78821  16.594   <2e-16 ***
Age          0.04999    0.09501   0.526    0.599    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.88 on 198 degrees of freedom
Multiple R-squared:  0.001396,  Adjusted R-squared:  -0.003647 
F-statistic: 0.2769 on 1 and 198 DF,  p-value: 0.5993

4.3 Example 1: Pulse vs Age

  • Intercept = Expected Pulse value when Age = 0
    • \(b_0\) = 96.048, \(t(1)\) = 16.594, \(p\) < .0001
    • Pulse for a person at Age = 0 is 96.048
  • Slope = Expected change in Pulse for a 1 unit difference in Age
    • \(b_1\) = 0.049, \(t(1)\) = 0.526, \(p\) > .05
    • For each 1 unit increase in Age, we expect Pulse to increase by 0.526 points (NS)

4.4 Example 2: Pulse vs Emergency

pulse_vs_ER <- lm(data = ICU, Pulse ~ Emergency)
summary(pulse_vs_ER)

Call:
lm(formula = Pulse ~ Emergency, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.741 -17.741  -2.927  18.259  90.259 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   91.113      3.637  25.049   <2e-16 ***
Emergency     10.628      4.243   2.505   0.0131 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.48 on 198 degrees of freedom
Multiple R-squared:  0.03072,   Adjusted R-squared:  0.02582 
F-statistic: 6.275 on 1 and 198 DF,  p-value: 0.01305

4.5 Example 2: Pulse vs Emergency

  • Intercept = Expected Pulse when Emergency = 0
    • \(b_0\) = 91.113, \(t(1)\) = 25.049, \(p\) < .0001
    • Pulse for a person w non-emergency admit is 91.113
  • Slope = Expected change in Pulse for a 1 unit difference in Emergency
    • \(b_1\) = 10.628, \(t(1)\) = 2.505, \(p\) < .05
    • For each 1 unit increase in Emergency, we expect Pulse to increase by 10.628 points
    • Group difference (0 to 1) is 10.628 points

4.6 Regression equations

  • Two forms
    • \(\hat{Y} = b_0 + b_1 X_1 + b_2 X_2 + \cdots\)
      • Predicted Y is a function of coefficients and predictors
    • \(Y = b_0 + b_1 X_1 + b_2 X_2 + \cdots + error\)
      • Observed Y is a function of coefficients and predictors, plus error

4.7 Regression equations

  • Example 1:
    • \(\hat{Pulse} = 96.048 + 0.049*Age\)
  • Example 2:
    • \(\hat{Pulse} = 91.113 + 10.628*Emergency\)

4.8 How to describe slopes

  • For each 1 unit increase (or 1 unit change or 1 unit difference) in X, we expect Y to increase by \(b_1\) points on average
    • Are we increasing or changing X?
    • In these models (and often), no
  • For two individuals who are 1 unit apart on X, we expect their scores on Y to differ by \(b_1\) points on average
    • This is a mouthful but more technically accurate

4.9 Reporting \(p\)-values

  • Non-significant effect
    • Thresholds: \(b_1\) = 0.049, \(t(1)\) = 0.526, \(p\) > .05
    • Exact \(p\)-value: \(b_1\) = 0.049, \(t(1)\) = 0.526, \(p\) = .599
    • Not significant: \(b_1\) = 0.049, \(t(1)\) = 0.526, NS
  • Significant effect
    • Thresholds: \(b_1\) = 10.628, \(t(1)\) = 2.505, \(p\) < .05
    • Exact \(p\)-value: \(b_1\) = 10.628, \(t(1)\) = 2.505, \(p\) = .0131
  • \(p\)-value is never 0: Report \(p\) < .001 or similar

5 Prediction

5.1 Observed and predicted values

  • Observed values of the outcome in the dataset
    • One value per observation
    • Observed outcome: \(Y\)
  • Predicted values of the outcome that are predicted by the model
    • One value per observation
    • Predicted outcome: \(\hat{Y}\)
    • Use the regression equation to calculate

5.2 Predicted values

  • Example 1: Pulse ~ Age
    • Age = 87: \(\hat{Pulse} = 96.048 + 0.049 * 87 = 100.311\)
    • Age = 27: \(\hat{Pulse} = 96.048 + 0.049 * 27 = 97.371\)
    • Age = 59: \(\hat{Pulse} = 96.048 + 0.049 * 59 = 98.939\)
    • Age = 0: \(\hat{Pulse} = 96.048 + 0.049 * 0 = 96.048\)
    • Etc. for each value of Age

5.3 Predicted values

  • Example 2: Pulse ~ Emergency
    • Emergency = 0: \(\hat{Pulse} = 91.113 + 10.628 * 0 = 91.113\)
    • Emergency = 1: \(\hat{Pulse} = 91.113 + 10.628 * 1 = 101.741\)

5.4 Predicted values

  • predict() function in R
    • Talk more about this in coming weeks
    • Can use to create predicted values for
      • Original data
      • All possible values
      • Specific values

5.5 Example 1: Observed and predicted

ggplot(data = ICU, 
       aes(x = Age, y = Pulse)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm",
              se = FALSE, 
              size = 2)

5.6 Example 2: Observed and predicted

ggplot(data = ICU, 
       aes(x = Emergency, y = Pulse)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm", 
              se = FALSE, 
              size = 2)

5.7 Residual values

  • Residuals are the difference between the observed and predicted values for an observation
    • One residual per observation
    • Residual: \(Y - \hat{Y}\)
  • Assumptions for residuals
    • Constant variance (homoscedasticity)
    • Conditionally normally distributed

5.8 Predicted and residual values

ICU$pred1 <- fitted(pulse_vs_age)
ICU$resi1 <- residuals(pulse_vs_age)
ICU$pred2 <- fitted(pulse_vs_ER)
ICU$resi2 <- residuals(pulse_vs_ER)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency     pred1
1  4       0  87        3   1         1    80    96         1 100.39753
2  8       1  27        1   1         1   142    88         1  97.39798
3 12       1  59        2   0         0   112    80         1  98.99774
4 14       1  77        3   0         0   100    70         0  99.89760
5 27       0  76        3   1         1   128    90         1  99.84761
6 28       1  54        2   0         1   142   103         1  98.74778
       resi1     pred2      resi2
1  -4.397527 101.74150  -5.741497
2  -9.397981 101.74150 -13.741497
3 -18.997739 101.74150 -21.741497
4 -29.897603  91.11321 -21.113208
5  -9.847611 101.74150 -11.741497
6   4.252223 101.74150   1.258503

5.9 What is the residual?

ggplot(data = ICU, 
       aes(x = Age, y = Pulse)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth(method = "lm",
              se = FALSE, 
              size = 2) +
  geom_segment(aes(x = Age,
                   xend = Age,
                   y = Pulse,
                   yend = pred1))

5.10 Constant variance?

  • Plot residual vs predictor
ggplot(data = ICU, 
       aes(x = Age, y = resi1)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

5.11 Constant variance?

  • Plot residual vs predicted (for more than 1 predictor)
ggplot(data = ICU, 
       aes(x = pred1, y = resi1)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

5.12 Normally distributed?

  • Q-Q plot (overall distribution)
ggplot(data = ICU,
       aes(sample = resi1)) + 
       stat_qq() +
       stat_qq_line()

5.13 Normally distributed?

  • Histogram
ggplot(data = ICU,
       aes(x = resi1)) +
  geom_histogram()

5.14 (Conditionally) Normally distributed?

  • Histogram at each level of predictor
ggplot(data = ICU,
       aes(x = resi2)) +
  geom_histogram() +
  facet_grid(cols = vars(Emergency))

5.15 Constant variance?

  • Plot residual vs predictor
ggplot(data = ICU, 
       aes(x = Emergency, y = resi2)) + 
  geom_point(alpha = 0.3, 
             size = 2) +
  geom_smooth()

6 Variance explained

6.1 R-squared multiple

  • One number summary of the predictive power of the model
    • Proportion: Ranges from 0 to 1
      • 0 = predictor(s) explain no variance in outcome
      • 1 = predictor(s) explain all variance in outcome
  • Squared correlation (between observed and predicted values)
    • Effect size for the model
    • Other definitions related to estimation

6.2 Correlations

  • Correlation = 0.2
    • \(R^2\) = 0.04

  • Correlation = 0.8
    • \(R^2\) = 0.64

6.3 R-squared multiple

  • We’ll talk more about \(R^2_{multiple}\)
    • Estimation
    • Effect size
    • Logistic regression and Poisson regression

7 Presenting results

7.1 Packages to show model results

  • summary() function from base R
  • summ() function from jtools package
  • modelsummary() function in modelsummary package
    • Nice HTML output, multiple models in one table
  • Non-exhaustive list

7.2 Example 1: summary()

summary(pulse_vs_age)

Call:
lm(formula = Pulse ~ Age, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-59.998 -18.098  -4.123  19.452  92.402 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 96.04818    5.78821  16.594   <2e-16 ***
Age          0.04999    0.09501   0.526    0.599    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.88 on 198 degrees of freedom
Multiple R-squared:  0.001396,  Adjusted R-squared:  -0.003647 
F-statistic: 0.2769 on 1 and 198 DF,  p-value: 0.5993

7.3 Example 1: summ()

library(jtools)
summ(pulse_vs_age)
Observations 200
Dependent variable Pulse
Type OLS linear regression
F(1,198) 0.28
0.00
Adj. R² -0.00
Est. S.E. t val. p
(Intercept) 96.05 5.79 16.59 0.00
Age 0.05 0.10 0.53 0.60
Standard errors: OLS

7.4 Example 1: modelsummary()

library(modelsummary)
modelsummary(pulse_vs_age)
(1)
(Intercept) 96.048
(5.788)
Age 0.050
(0.095)
Num.Obs. 200
R2 0.001
R2 Adj. -0.004
AIC 1888.1
BIC 1898.0
Log.Lik. -941.048
F 0.277
RMSE 26.74

7.5 Example 1: summary() + kableExtra()

library(kableExtra)
kable(summary(pulse_vs_age)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.0481848 5.7882105 16.593761 0.0000000
Age 0.0499924 0.0950087 0.526188 0.5993465

7.6 Example 2: summary()

summary(pulse_vs_ER)

Call:
lm(formula = Pulse ~ Emergency, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-62.741 -17.741  -2.927  18.259  90.259 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   91.113      3.637  25.049   <2e-16 ***
Emergency     10.628      4.243   2.505   0.0131 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.48 on 198 degrees of freedom
Multiple R-squared:  0.03072,   Adjusted R-squared:  0.02582 
F-statistic: 6.275 on 1 and 198 DF,  p-value: 0.01305

7.7 Example 2: summ()

library(jtools)
summ(pulse_vs_ER)
Observations 200
Dependent variable Pulse
Type OLS linear regression
F(1,198) 6.28
0.03
Adj. R² 0.03
Est. S.E. t val. p
(Intercept) 91.11 3.64 25.05 0.00
Emergency 10.63 4.24 2.51 0.01
Standard errors: OLS

7.8 Example 2: modelsummary()

library(modelsummary)
modelsummary(pulse_vs_ER)
(1)
(Intercept) 91.113
(3.637)
Emergency 10.628
(4.243)
Num.Obs. 200
R2 0.031
R2 Adj. 0.026
AIC 1882.1
BIC 1892.0
Log.Lik. -938.068
F 6.275
RMSE 26.35

7.9 Example 2: summary() + kableExtra()

library(kableExtra)
kable(summary(pulse_vs_ER)$coefficients)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.11321 3.637436 25.048745 0.0000000
Emergency 10.62829 4.242792 2.505023 0.0130508

7.10 Multiple models in modelsummary()

library(modelsummary)
modelsummary(list(pulse_vs_age, pulse_vs_ER, pulse_age_ER))
(1) (2) (3)
(Intercept) 96.048 91.113 84.913
(5.788) (3.637) (7.081)
Age 0.050 0.097
(0.095) (0.095)
Emergency 10.628 11.452
(4.243) (4.318)
Num.Obs. 200 200 200
R2 0.001 0.031 0.036
R2 Adj. -0.004 0.026 0.026
AIC 1888.1 1882.1 1883.1
BIC 1898.0 1892.0 1896.3
Log.Lik. -941.048 -938.068 -937.540
F 0.277 6.275 3.659
RMSE 26.74 26.35 26.28

7.11 Example 1: Full results

  • The intercept (\(b_0\)) was 96.048, \(t(1)\) = 16.594, \(p\) < .0001. An individual who was 0 years old is expected to have a pulse rate of 96.048.

  • The slope for age (\(b_1\)) was 0.049, \(t(1)\) = 0.526, \(p\) > .05. For each 1 year increase in age, the expected pulse rate increased by 0.048 points; this effect was not significant.

  • The \(R^2\) for this model was 0.0014, F(1, 198) = 0.2969, p > .05. Less than 1% of the variance in pulse rate was explained by age; this was not significant.

7.12 Example 2: Full results

  • The intercept (\(b_0\)) was 91.113, \(t(1)\) = 25.049, \(p\) < .0001. An individual who had a non-emergency admission is expected to have a pulse rate of 91.113.

  • The slope for type of admission (\(b_1\)) was 10.628, \(t(1)\) = 2.505, \(p\) < .05. The difference in expected pulse rate between non-emergency admissions (0) and emergency admissions (1) was 10.628 points.

  • The \(R^2\) for this model was 0.0307, F(1, 198) = 6.275, p < .05. About 3% of the variance in pulse rate was explained by admission type.

8 Next time

8.1 Expanding the model

  • More complicated models
    • Two or more predictors
    • Non-linear relationships, interaction (moderator) effects
  • Predictor coding and centering
  • Comparing models
  • More interpretation, more checking assumptions
  • Estimation of the model