Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

Assess models for (multi)collinearity among predictors
Conduct outlier analyses to determine extreme and/or problematic cases

2 Collinearity

2.1 Collinearity

Sometimes called multicollinearity
High correlations between predictors in linear regression
Why does that matter?

2.2 Recall: Partial correlations

Correlation between \(X_1\) and \(Y\), no effect of \(X_2\)

2.3 Recall: Partial correlations

Correlation between \(X_1\) and \(Y\), removing the effect of \(X_2\)

2.4 Recall: Partial correlations

Correlation between \(X_1\) and \(Y\), removing the effect of \(X_2\)

2.5 Why “collinear”?

Variables are vectors
Represent any vector as a line from the origin (0,0) to a point
Angles between vectors are correlations
- 180 degrees = -1
- 90 degrees = 0
- On top of one another = 1

2.6 Data

ICU data from the Stat2Data package

library(Stat2Data)
data(ICU)
colnames(ICU)

[1] "ID"        "Survive"   "Age"       "AgeGroup"  "Sex"       "Infection"
[7] "SysBP"     "Pulse"     "Emergency"

2.7 Correlations between variables

cor(ICU)

                    ID     Survive         Age    AgeGroup          Sex
ID         1.000000000  0.09008759 -0.13379591 -0.09294666  0.004507259
Survive    0.090087587  1.00000000 -0.18945786 -0.19137028 -0.020602141
Age       -0.133795910 -0.18945786  1.00000000  0.89932798  0.096077315
AgeGroup  -0.092946657 -0.19137028  0.89932798  1.00000000  0.106451213
Sex        0.004507259 -0.02060214  0.09607732  0.10645121  1.000000000
Infection -0.129037797 -0.18234920  0.15355452  0.10210375  0.022540704
SysBP      0.007579045  0.20467231  0.04259431  0.04455849  0.049428168
Pulse      0.027350603 -0.03176090  0.03736843  0.02249490  0.039529993
Emergency  0.199467075 -0.24358013 -0.18695714 -0.19765780  0.119971717
           Infection        SysBP       Pulse  Emergency
ID        -0.1290378  0.007579045  0.02735060  0.1994671
Survive   -0.1823492  0.204672308 -0.03176090 -0.2435801
Age        0.1535545  0.042594310  0.03736843 -0.1869571
AgeGroup   0.1021037  0.044558494  0.02249490 -0.1976578
Sex        0.0225407  0.049428168  0.03952999  0.1199717
Infection  1.0000000 -0.228538626  0.31051173  0.1666485
SysBP     -0.2285386  1.000000000 -0.05658246 -0.1841112
Pulse      0.3105117 -0.056582465  1.00000000  0.1752685
Emergency  0.1666485 -0.184111172  0.17526851  1.0000000

2.8 Predict `SysBP` from “everthing”

m1 <- lm(data = ICU, SysBP ~ . - ID - AgeGroup)
summary(m1)


Call:
lm(formula = SysBP ~ . - ID - AgeGroup, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-78.827 -19.835  -2.923  19.312 129.997 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 121.72860   13.42593   9.067  < 2e-16 ***
Survive      12.94692    6.00852   2.155  0.03242 *  
Age           0.12509    0.12023   1.040  0.29944    
Sex           4.20390    4.68840   0.897  0.37102    
Infection   -13.48289    4.92376  -2.738  0.00675 ** 
Pulse         0.03095    0.08903   0.348  0.72848    
Emergency    -8.16479    5.58844  -1.461  0.14564    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.68 on 193 degrees of freedom
Multiple R-squared:  0.1036,    Adjusted R-squared:  0.07572 
F-statistic: 3.717 on 6 and 193 DF,  p-value: 0.001606

2.9 `Emergency` predicts `SysBP` alone

m2 <- lm(data = ICU, SysBP ~ Emergency)
summary(m2)


Call:
lm(formula = SysBP ~ Emergency, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-92.646 -18.646  -0.502  17.426 127.354 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  142.358      4.460  31.918  < 2e-16 ***
Emergency    -13.712      5.202  -2.636  0.00906 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.47 on 198 degrees of freedom
Multiple R-squared:  0.0339,    Adjusted R-squared:  0.02902 
F-statistic: 6.947 on 1 and 198 DF,  p-value: 0.009061

2.10 Variance inflation factor (VIF)

VIF = \(\frac{1}{1 - R^2_j}\)
- \(R^2_j\) is \(R^2\) for all other predictors predicting that predictor
Standard error of a regression coefficient = \(\frac{s^2}{(n - 1) \hat{var}(X_j)} \times \frac{1}{1 - R^2_j}\)
- \(s^2\) is the variance of the residuals
- \(n\) is the sample size
- \(\hat{var}(X_j)\) is the estimated variance of the predictor

2.11 Variance inflation factor (VIF)

Cut-offs for high VIF: 5 (common) or 10
VIF is not the ratio of SEs for big model vs small model
- Square root of VIF is how much larger the standard error is compared to a model where the predictor has 0 correlation to all other predictors
Might also / instead hear about tolerance
- \(\frac{1}{VIF}\) or \(1 - R^2_j\)

2.12 Variance inflation factor (VIF)

library(car)
vif(m1)

  Survive       Age       Sex Infection     Pulse Emergency 
 1.151100  1.152752  1.032009  1.176871  1.131408  1.212195

2.13 Big model vs small model

Two things change
- Big model has partial regression coefficients
  - “Holding all other variables constant (at 0)”
  - Changes the regression coefficient: -8.16 vs -13.71
- Variance inflated in big model due to correlated predictors
  - Changes the standard error: 5.59 vs 5.20
  - About 7.5% increase

2.14 Highly correlated predictors

Age and AgeGroup are correlated 0.899

m3 <- lm(data = ICU, SysBP ~ . - ID)
library(car)
vif(m3)

  Survive       Age  AgeGroup       Sex Infection     Pulse Emergency 
 1.159103  5.356614  5.334519  1.035521  1.184188  1.131564  1.220301

2.15 Summary: Collinearity

Look at correlations between predictors
- Not the whole story
- VIF uses \(R^2\) for all other predictors predicting that predictor
Look at VIF values for the model
- Cut off of 5
Very high VIF
- Do you need all of those predictors?
- Can you combine some predictors?

3 Outliers

3.1 Outliers

Observations that are
- Extreme on predictor(s)
- Extreme on predicted values
- Change coefficients or predicted values

3.2 Outlier statistics

Model-based
- Based on a linear regression (or other) model
Case-wise
- One value per case (observation)
Three types of outlier statistics for linear regression
- Leverage
- Discrepancy
- Influence

3.3 Leverage

Extreme in the predictor space
- “Hat matrix” from the model
How far is a case from the centroid of the predictor space?
Cut offs with \(k\) predictors
- Large samples, many predictors: \(\frac{2(k + 1)}{n}\)
- Smaller samples: \(\frac{3(k + 1)}{n}\)

3.4 Centroid of predictor space

If just Age and Pulse are predictors

ggplot(data = ICU, 
       aes(x = Pulse, y = Age)) +
  geom_point(alpha = 0.3, 
             color = "blue",
             size = 4) +
  annotate("pointrange",
           x = mean(ICU$Pulse, na.rm = TRUE),
           y = mean(ICU$Age, na.rm = TRUE),
           ymin = mean(ICU$Age, na.rm = TRUE),
           ymax = mean(ICU$Age, na.rm = TRUE),
           color = "red",
           size = 2) +
  stat_ellipse(linetype = "dashed")

3.5 Leverage

hatvalues()
.hat value: augment() from broom package

3.6 Leverage: `hatvalues()`

hatvalues(m1)

         1          2          3          4          5          6          7 
0.04136616 0.04193692 0.02038323 0.02868237 0.03909928 0.02054581 0.04833027 
         8          9         10         11         12         13         14 
0.02676640 0.02572493 0.02918766 0.02802245 0.03910343 0.03386752 0.04735363 
        15         16         17         18         19         20         21 
0.03205395 0.02595774 0.03209545 0.02134725 0.02185566 0.04258039 0.03541913 
        22         23         24         25         26         27         28 
0.06151462 0.02853602 0.02468023 0.02064414 0.02803684 0.04139703 0.03119456 
        29         30         31         32         33         34         35 
0.02144683 0.06060417 0.01887221 0.03292118 0.02384578 0.04691193 0.03465486 
        36         37         38         39         40         41         42 
0.04347021 0.04875004 0.01691440 0.02291959 0.02774395 0.04082458 0.03992122 
        43         44         45         46         47         48         49 
0.03859821 0.02506167 0.02505285 0.03525264 0.08745421 0.04394462 0.02320841 
        50         51         52         53         54         55         56 
0.06811084 0.02658598 0.03343375 0.03451670 0.04765859 0.03440951 0.04627143 
        57         58         59         60         61         62         63 
0.05858381 0.03742786 0.03961405 0.02717904 0.01852133 0.02332432 0.05111651 
        64         65         66         67         68         69         70 
0.04346598 0.02468261 0.03152511 0.02466933 0.04996516 0.04636771 0.02213098 
        71         72         73         74         75         76         77 
0.03373205 0.04023063 0.02478997 0.03386846 0.02579042 0.04942887 0.03371832 
        78         79         80         81         82         83         84 
0.02076133 0.04039878 0.08388916 0.02664010 0.02037631 0.04253115 0.03320655 
        85         86         87         88         89         90         91 
0.03446771 0.03413661 0.03105504 0.02060281 0.02100419 0.03725994 0.01970532 
        92         93         94         95         96         97         98 
0.03137055 0.06045898 0.02893790 0.04554696 0.03883473 0.04452235 0.02419923 
        99        100        101        102        103        104        105 
0.04103317 0.04157161 0.04263943 0.02834146 0.03718586 0.05326585 0.04575464 
       106        107        108        109        110        111        112 
0.04134128 0.04924349 0.02317801 0.03378472 0.02522227 0.04353502 0.04001761 
       113        114        115        116        117        118        119 
0.03365728 0.02431396 0.03067899 0.02419923 0.02889805 0.02072236 0.04452363 
       120        121        122        123        124        125        126 
0.02657169 0.03704621 0.03349834 0.05330341 0.03367753 0.03904766 0.04777942 
       127        128        129        130        131        132        133 
0.02935877 0.03577744 0.03885206 0.02773097 0.07196314 0.05090712 0.02046498 
       134        135        136        137        138        139        140 
0.03231415 0.02599930 0.03545566 0.04728562 0.03446484 0.03948813 0.02513356 
       141        142        143        144        145        146        147 
0.03918547 0.02311702 0.02587308 0.02725079 0.01829482 0.02781909 0.05032186 
       148        149        150        151        152        153        154 
0.04081059 0.03249827 0.03569586 0.04139041 0.02789797 0.04382672 0.05040804 
       155        156        157        158        159        160        161 
0.03172093 0.01894094 0.03222629 0.03516660 0.07547144 0.07788693 0.03140165 
       162        163        164        165        166        167        168 
0.01823211 0.03893800 0.03294626 0.02311711 0.03260965 0.04524500 0.03446130 
       169        170        171        172        173        174        175 
0.02021935 0.03541123 0.03357409 0.01683831 0.02664340 0.03274660 0.02132949 
       176        177        178        179        180        181        182 
0.02242424 0.02664338 0.02779856 0.05677280 0.03169522 0.05465986 0.03619239 
       183        184        185        186        187        188        189 
0.03233791 0.03169522 0.02125515 0.05362754 0.02740321 0.03917655 0.02877631 
       190        191        192        193        194        195        196 
0.04178583 0.02413103 0.02456579 0.04330927 0.02560642 0.02355296 0.04929066 
       197        198        199        200 
0.02820125 0.02644314 0.01896565 0.01786346

3.7 Leverage: `augment()`

library(broom)
m1_aug <- augment(m1)
m1_aug

# A tibble: 200 × 15
   SysBP    ID Survive   Age AgeGroup   Sex Infection Pulse Emergency .fitted
   <int> <int>   <int> <int>    <int> <int>     <int> <int>     <int>   <dbl>
 1    80     4       0    87        3     1         1    96         1    118.
 2   142     8       1    27        1     1         1    88         1    123.
 3   112    12       1    59        2     0         0    80         1    136.
 4   100    14       1    77        3     0         0    70         0    146.
 5   128    27       0    76        3     1         1    90         1    117.
 6   142    28       1    54        2     0         1   103         1    123.
 7   110    32       1    87        3     1         1   154         1    133.
 8   110    38       1    69        2     0         1   132         1    126.
 9   104    40       1    63        2     0         0    66         0    145.
10   144    41       1    30        1     1         0   110         1    138.
# ℹ 190 more rows
# ℹ 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>

3.8 Histogram of leverage values

ggplot(data = m1_aug,
       aes(x = .hat)) +
  geom_histogram()

3.9 Index plot of leverage values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .hat)) +
  geom_point()

3.10 Index plot of leverage values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .hat)) +
  geom_point() +
  geom_hline(yintercept = 3*(6+1)/200, 
             color = "red",
             linetype = "dashed") +
  geom_hline(yintercept = 2*(6+1)/200,
             color = "blue") +
  geom_text(aes(label=ifelse((.hat > 2*(6+1)/200), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

3.11 Discrepancy

Extreme predicted values
- Discrepancy between observed and predicted values
- Residual = Observed - Predicted = \(Y - \hat{Y}\)
- Externally studentized residuals
How far is a case from the regression line?
ES residual follows \(t\) distribution with \(n - k - 1\) df
- Cut off: \(\pm3\) for large sample, otherwise \(\pm2\)

3.12 Raw residuals

Residual = \(e_i = Y_i - \hat{Y}_i\)
Problems
- Metric depends on the scale of the outcome
- An outlier hides by pulling the regression line toward itself

3.13 Hiding outlier

Code

x <- c(1:15)
y <- 2 * x + rnorm(15, 0, 8)
dat <- data.frame(x, y)
dat_out <- dat
dat_out[14,2] <- 50
dat_del <- dat_out
dat_del[14,2] <- NA

ggplot(data = dat_out,
       aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(data = dat_out, method = "lm", se = FALSE, color = "black") +
  ylim(0, 60)

3.14 Hiding outlier

Code

ggplot(data = dat_out,
       aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(data = dat_out, method = "lm", se = FALSE, color = "black") +
  geom_smooth(data = dat_del, method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  annotate("rect", xmin = 13.5, xmax = 14.5, ymin = 46, ymax = 54, fill = "red", alpha = 0.2) +
  ylim(0, 60)

3.15 Internally studentized residual

“Studentized” = standardized
- Divide raw residual by its standard deviation
- \(SD_{e_i} = \sqrt{MS_{error}(1 - h_{ii})}\)
Internally studentized residual = \(\frac{e_i}{SD_{e_i}}\)
Problems
- What distribution does this follow???
- Outlier pulls line toward itself, increasing \(MS_{error}\)

3.16 Externally studentized residual

Deleted residual: Difference between observed value and predicted value in a regression in which the point was deleted
Deleted residual = \(d_i = Y_i - \hat{Y}_{i(i)}\)
- Standard deviation based on deleted: \(SD_{d} = \sqrt{MS_{error(i)}(1 - h_{ii})}\)
Externally studentized residual = \(\frac{d_i}{SD_{d}}\)

3.17 Discrepancy: Externally studentized residuals

rstudent() function from the car package

library(car)
m1_stresid <- ICU %>% mutate(esresid = rstudent(m1))
head(m1_stresid)

  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency    esresid
1  4       0  87        3   1         1    80    96         1 -1.2312289
2  8       1  27        1   1         1   142    88         1  0.6009965
3 12       1  59        2   0         0   112    80         1 -0.7763291
4 14       1  77        3   0         0   100    70         0 -1.4932232
5 27       0  76        3   1         1   128    90         1  0.3669971
6 28       1  54        2   0         1   142   103         1  0.6059387

3.18 Histogram of ES residuals

ggplot(data = m1_stresid,
       aes(x = esresid)) +
  geom_histogram()

3.19 Index plot of ES residuals

ggplot(data = m1_stresid,
       aes(x = c(1:nrow(m1_stresid)), 
           y = esresid)) +
  geom_point()

3.20 Index plot of ES residuals

ggplot(data = m1_stresid,
       aes(x = c(1:nrow(m1_stresid)), 
           y = esresid)) +
  geom_point() +
  geom_hline(yintercept = 2, 
             color = "blue",
             linetype = "dashed") +
  geom_hline(yintercept = -2,
             color = "blue",
             linetype = "dashed") +
  geom_text(aes(label=ifelse((esresid > 2), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2) +
  geom_text(aes(label=ifelse((esresid < -2), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

3.21 What’s up with `ID` = 921?

library(kableExtra)
m1_aug %>% filter(ID == 921) %>%
  kable()

SysBP	ID	Survive	Age	AgeGroup	Sex	Infection	Pulse	Emergency	.fitted	.resid	.hat	.sigma	.cooksd	.std.resid
256	921	0	50	2	1	0	64	1	126.0032	129.9968	0.0492907	30.26998	0.1311792	4.208459

3.22 Influence

Changes coefficients and/or predicted values
- Predicted values: Cook’s D and DFFITS
- Regression coefficients: DFBETAS
Based on case deletion
How much does the case change the coefficient(s) and/or the predicted values?
Cutoff: Cook’s D > 1

3.23 Influence = leverage x discrepancy

Code

ggplot(data = dat,
       aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "black")

3.24 Histogram of influence values

ggplot(data = m1_aug,
       aes(x = .cooksd)) +
  geom_histogram()

3.25 Index plot of Influence values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .cooksd)) +
  geom_point()

3.26 Index plot of influence values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .cooksd)) +
  geom_point() +
  geom_hline(yintercept = 1, 
             color = "red",
             linetype = "dashed") +
  geom_text(aes(label=ifelse((.cooksd > 1), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

3.27 Summary: Handling outliers

Leverage and discrepancy will flag most extreme ~5%
Don’t base decisions on a single measure
- Look for a pattern of suspicious cases
- Same case shows up with high leverage, high discrepancy, high influence
Don’t just delete cases
- Sensitivity analysis: Same conclusions?

3.28 Repeat offenders?

Leverage

Code

m1_aug %>% filter(.hat > 2*(6+1)/200) %>% select(ID)

# A tibble: 5 × 1
     ID
  <int>
1   202
2   333
3   584
4   732
5   740

Discrepancy

Code

m1_stresid %>% filter(esresid > 2) %>% select(ID)

Code

m1_stresid %>% filter(esresid < -2) %>% select(ID)

   ID
1  84
2 331

Influence

Code

m1_aug %>% filter(.cooksd > 1) %>% select(ID)

# A tibble: 0 × 1
# ℹ 1 variable: ID <int>

4 Next time

4.1 Some details

Hypothesis testing
- Sampling distribution, degrees of freedom
Estimation of the model
- \(R^2\), likelihood
Interaction (moderator) effects

Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

2 Collinearity

2.1 Collinearity

2.2 Recall: Partial correlations

2.3 Recall: Partial correlations

2.4 Recall: Partial correlations

2.5 Why “collinear”?

2.6 Data

2.7 Correlations between variables

2.8 Predict SysBP from “everthing”

2.9 Emergency predicts SysBP alone

2.10 Variance inflation factor (VIF)

2.11 Variance inflation factor (VIF)

2.12 Variance inflation factor (VIF)

2.13 Big model vs small model

2.14 Highly correlated predictors

2.15 Summary: Collinearity

3 Outliers

3.1 Outliers

3.2 Outlier statistics

3.3 Leverage

3.4 Centroid of predictor space

3.5 Leverage

3.6 Leverage: hatvalues()

3.7 Leverage: augment()

3.8 Histogram of leverage values

3.9 Index plot of leverage values

3.10 Index plot of leverage values

3.11 Discrepancy

3.12 Raw residuals

3.13 Hiding outlier

3.14 Hiding outlier

3.15 Internally studentized residual

3.16 Externally studentized residual

3.17 Discrepancy: Externally studentized residuals

3.18 Histogram of ES residuals

3.19 Index plot of ES residuals

3.20 Index plot of ES residuals

3.21 What’s up with ID = 921?

3.22 Influence

3.23 Influence = leverage x discrepancy

3.24 Histogram of influence values

3.25 Index plot of Influence values

3.26 Index plot of influence values

3.27 Summary: Handling outliers

3.28 Repeat offenders?

4 Next time

4.1 Some details

2.8 Predict `SysBP` from “everthing”

2.9 `Emergency` predicts `SysBP` alone

3.6 Leverage: `hatvalues()`

3.7 Leverage: `augment()`

3.21 What’s up with `ID` = 921?