Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Assess models for (multi)collinearity among predictors
  • Conduct outlier analyses to determine extreme and/or problematic cases

2 Collinearity

2.1 Collinearity

  • Sometimes called multicollinearity
  • High correlations between predictors in linear regression
  • Why does that matter?

2.2 Recall: Partial correlations

Correlation between \(X_1\) and \(Y\), no effect of \(X_2\)

2.3 Recall: Partial correlations

Correlation between \(X_1\) and \(Y\), removing the effect of \(X_2\)

2.4 Recall: Partial correlations

Correlation between \(X_1\) and \(Y\), removing the effect of \(X_2\)

2.5 Why “collinear”?

  • Variables are vectors
  • Represent any vector as a line from the origin (0,0) to a point
  • Angles between vectors are correlations
    • 180 degrees = -1
    • 90 degrees = 0
    • On top of one another = 1

2.6 Data

  • ICU data from the Stat2Data package
library(Stat2Data)
data(ICU)
colnames(ICU)
[1] "ID"        "Survive"   "Age"       "AgeGroup"  "Sex"       "Infection"
[7] "SysBP"     "Pulse"     "Emergency"

2.7 Correlations between variables

cor(ICU)
                    ID     Survive         Age    AgeGroup          Sex
ID         1.000000000  0.09008759 -0.13379591 -0.09294666  0.004507259
Survive    0.090087587  1.00000000 -0.18945786 -0.19137028 -0.020602141
Age       -0.133795910 -0.18945786  1.00000000  0.89932798  0.096077315
AgeGroup  -0.092946657 -0.19137028  0.89932798  1.00000000  0.106451213
Sex        0.004507259 -0.02060214  0.09607732  0.10645121  1.000000000
Infection -0.129037797 -0.18234920  0.15355452  0.10210375  0.022540704
SysBP      0.007579045  0.20467231  0.04259431  0.04455849  0.049428168
Pulse      0.027350603 -0.03176090  0.03736843  0.02249490  0.039529993
Emergency  0.199467075 -0.24358013 -0.18695714 -0.19765780  0.119971717
           Infection        SysBP       Pulse  Emergency
ID        -0.1290378  0.007579045  0.02735060  0.1994671
Survive   -0.1823492  0.204672308 -0.03176090 -0.2435801
Age        0.1535545  0.042594310  0.03736843 -0.1869571
AgeGroup   0.1021037  0.044558494  0.02249490 -0.1976578
Sex        0.0225407  0.049428168  0.03952999  0.1199717
Infection  1.0000000 -0.228538626  0.31051173  0.1666485
SysBP     -0.2285386  1.000000000 -0.05658246 -0.1841112
Pulse      0.3105117 -0.056582465  1.00000000  0.1752685
Emergency  0.1666485 -0.184111172  0.17526851  1.0000000

2.8 Predict SysBP from “everthing”

m1 <- lm(data = ICU, SysBP ~ . - ID - AgeGroup)
summary(m1)

Call:
lm(formula = SysBP ~ . - ID - AgeGroup, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-78.827 -19.835  -2.923  19.312 129.997 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 121.72860   13.42593   9.067  < 2e-16 ***
Survive      12.94692    6.00852   2.155  0.03242 *  
Age           0.12509    0.12023   1.040  0.29944    
Sex           4.20390    4.68840   0.897  0.37102    
Infection   -13.48289    4.92376  -2.738  0.00675 ** 
Pulse         0.03095    0.08903   0.348  0.72848    
Emergency    -8.16479    5.58844  -1.461  0.14564    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 31.68 on 193 degrees of freedom
Multiple R-squared:  0.1036,    Adjusted R-squared:  0.07572 
F-statistic: 3.717 on 6 and 193 DF,  p-value: 0.001606

2.9 Emergency predicts SysBP alone

m2 <- lm(data = ICU, SysBP ~ Emergency)
summary(m2)

Call:
lm(formula = SysBP ~ Emergency, data = ICU)

Residuals:
    Min      1Q  Median      3Q     Max 
-92.646 -18.646  -0.502  17.426 127.354 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  142.358      4.460  31.918  < 2e-16 ***
Emergency    -13.712      5.202  -2.636  0.00906 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.47 on 198 degrees of freedom
Multiple R-squared:  0.0339,    Adjusted R-squared:  0.02902 
F-statistic: 6.947 on 1 and 198 DF,  p-value: 0.009061

2.10 Variance inflation factor (VIF)

  • VIF = \(\frac{1}{1 - R^2_j}\)
    • \(R^2_j\) is \(R^2\) for all other predictors predicting that predictor
  • Standard error of a regression coefficient = \(\frac{s^2}{(n - 1) \hat{var}(X_j)} \times \frac{1}{1 - R^2_j}\)
    • \(s^2\) is the variance of the residuals
    • \(n\) is the sample size
    • \(\hat{var}(X_j)\) is the estimated variance of the predictor

2.11 Variance inflation factor (VIF)

  • Cut-offs for high VIF: 5 (common) or 10
  • VIF is not the ratio of SEs for big model vs small model
    • Square root of VIF is how much larger the standard error is compared to a model where the predictor has 0 correlation to all other predictors
  • Might also / instead hear about tolerance
    • \(\frac{1}{VIF}\) or \(1 - R^2_j\)

2.12 Variance inflation factor (VIF)

library(car)
vif(m1)
  Survive       Age       Sex Infection     Pulse Emergency 
 1.151100  1.152752  1.032009  1.176871  1.131408  1.212195 

2.13 Big model vs small model

  • Two things change
    • Big model has partial regression coefficients
      • “Holding all other variables constant (at 0)
      • Changes the regression coefficient: -8.16 vs -13.71
    • Variance inflated in big model due to correlated predictors
      • Changes the standard error: 5.59 vs 5.20
      • About 7.5% increase

2.14 Highly correlated predictors

  • Age and AgeGroup are correlated 0.899
m3 <- lm(data = ICU, SysBP ~ . - ID)
library(car)
vif(m3)
  Survive       Age  AgeGroup       Sex Infection     Pulse Emergency 
 1.159103  5.356614  5.334519  1.035521  1.184188  1.131564  1.220301 

2.15 Summary: Collinearity

  • Look at correlations between predictors
    • Not the whole story
    • VIF uses \(R^2\) for all other predictors predicting that predictor
  • Look at VIF values for the model
    • Cut off of 5
  • Very high VIF
    • Do you need all of those predictors?
    • Can you combine some predictors?

3 Outliers

3.1 Outliers

  • Observations that are
    • Extreme on predictor(s)
    • Extreme on predicted values
    • Change coefficients or predicted values

3.2 Outlier statistics

  • Model-based
    • Based on a linear regression (or other) model
  • Case-wise
    • One value per case (observation)
  • Three types of outlier statistics for linear regression
    • Leverage
    • Discrepancy
    • Influence

3.3 Leverage

  • Extreme in the predictor space
    • “Hat matrix” from the model
  • How far is a case from the centroid of the predictor space?
  • Cut offs with \(k\) predictors
    • Large samples, many predictors: \(\frac{2(k + 1)}{n}\)
    • Smaller samples: \(\frac{3(k + 1)}{n}\)

3.4 Centroid of predictor space

  • If just Age and Pulse are predictors
ggplot(data = ICU, 
       aes(x = Pulse, y = Age)) +
  geom_point(alpha = 0.3, 
             color = "blue",
             size = 4) +
  annotate("pointrange",
           x = mean(ICU$Pulse, na.rm = TRUE),
           y = mean(ICU$Age, na.rm = TRUE),
           ymin = mean(ICU$Age, na.rm = TRUE),
           ymax = mean(ICU$Age, na.rm = TRUE),
           color = "red",
           size = 2) +
  stat_ellipse(linetype = "dashed")

3.5 Leverage

  • hatvalues()
  • .hat value: augment() from broom package

3.6 Leverage: hatvalues()

hatvalues(m1)
         1          2          3          4          5          6          7 
0.04136616 0.04193692 0.02038323 0.02868237 0.03909928 0.02054581 0.04833027 
         8          9         10         11         12         13         14 
0.02676640 0.02572493 0.02918766 0.02802245 0.03910343 0.03386752 0.04735363 
        15         16         17         18         19         20         21 
0.03205395 0.02595774 0.03209545 0.02134725 0.02185566 0.04258039 0.03541913 
        22         23         24         25         26         27         28 
0.06151462 0.02853602 0.02468023 0.02064414 0.02803684 0.04139703 0.03119456 
        29         30         31         32         33         34         35 
0.02144683 0.06060417 0.01887221 0.03292118 0.02384578 0.04691193 0.03465486 
        36         37         38         39         40         41         42 
0.04347021 0.04875004 0.01691440 0.02291959 0.02774395 0.04082458 0.03992122 
        43         44         45         46         47         48         49 
0.03859821 0.02506167 0.02505285 0.03525264 0.08745421 0.04394462 0.02320841 
        50         51         52         53         54         55         56 
0.06811084 0.02658598 0.03343375 0.03451670 0.04765859 0.03440951 0.04627143 
        57         58         59         60         61         62         63 
0.05858381 0.03742786 0.03961405 0.02717904 0.01852133 0.02332432 0.05111651 
        64         65         66         67         68         69         70 
0.04346598 0.02468261 0.03152511 0.02466933 0.04996516 0.04636771 0.02213098 
        71         72         73         74         75         76         77 
0.03373205 0.04023063 0.02478997 0.03386846 0.02579042 0.04942887 0.03371832 
        78         79         80         81         82         83         84 
0.02076133 0.04039878 0.08388916 0.02664010 0.02037631 0.04253115 0.03320655 
        85         86         87         88         89         90         91 
0.03446771 0.03413661 0.03105504 0.02060281 0.02100419 0.03725994 0.01970532 
        92         93         94         95         96         97         98 
0.03137055 0.06045898 0.02893790 0.04554696 0.03883473 0.04452235 0.02419923 
        99        100        101        102        103        104        105 
0.04103317 0.04157161 0.04263943 0.02834146 0.03718586 0.05326585 0.04575464 
       106        107        108        109        110        111        112 
0.04134128 0.04924349 0.02317801 0.03378472 0.02522227 0.04353502 0.04001761 
       113        114        115        116        117        118        119 
0.03365728 0.02431396 0.03067899 0.02419923 0.02889805 0.02072236 0.04452363 
       120        121        122        123        124        125        126 
0.02657169 0.03704621 0.03349834 0.05330341 0.03367753 0.03904766 0.04777942 
       127        128        129        130        131        132        133 
0.02935877 0.03577744 0.03885206 0.02773097 0.07196314 0.05090712 0.02046498 
       134        135        136        137        138        139        140 
0.03231415 0.02599930 0.03545566 0.04728562 0.03446484 0.03948813 0.02513356 
       141        142        143        144        145        146        147 
0.03918547 0.02311702 0.02587308 0.02725079 0.01829482 0.02781909 0.05032186 
       148        149        150        151        152        153        154 
0.04081059 0.03249827 0.03569586 0.04139041 0.02789797 0.04382672 0.05040804 
       155        156        157        158        159        160        161 
0.03172093 0.01894094 0.03222629 0.03516660 0.07547144 0.07788693 0.03140165 
       162        163        164        165        166        167        168 
0.01823211 0.03893800 0.03294626 0.02311711 0.03260965 0.04524500 0.03446130 
       169        170        171        172        173        174        175 
0.02021935 0.03541123 0.03357409 0.01683831 0.02664340 0.03274660 0.02132949 
       176        177        178        179        180        181        182 
0.02242424 0.02664338 0.02779856 0.05677280 0.03169522 0.05465986 0.03619239 
       183        184        185        186        187        188        189 
0.03233791 0.03169522 0.02125515 0.05362754 0.02740321 0.03917655 0.02877631 
       190        191        192        193        194        195        196 
0.04178583 0.02413103 0.02456579 0.04330927 0.02560642 0.02355296 0.04929066 
       197        198        199        200 
0.02820125 0.02644314 0.01896565 0.01786346 

3.7 Leverage: augment()

library(broom)
m1_aug <- augment(m1)
m1_aug
# A tibble: 200 × 15
   SysBP    ID Survive   Age AgeGroup   Sex Infection Pulse Emergency .fitted
   <int> <int>   <int> <int>    <int> <int>     <int> <int>     <int>   <dbl>
 1    80     4       0    87        3     1         1    96         1    118.
 2   142     8       1    27        1     1         1    88         1    123.
 3   112    12       1    59        2     0         0    80         1    136.
 4   100    14       1    77        3     0         0    70         0    146.
 5   128    27       0    76        3     1         1    90         1    117.
 6   142    28       1    54        2     0         1   103         1    123.
 7   110    32       1    87        3     1         1   154         1    133.
 8   110    38       1    69        2     0         1   132         1    126.
 9   104    40       1    63        2     0         0    66         0    145.
10   144    41       1    30        1     1         0   110         1    138.
# ℹ 190 more rows
# ℹ 5 more variables: .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>

3.8 Histogram of leverage values

ggplot(data = m1_aug,
       aes(x = .hat)) +
  geom_histogram()

3.9 Index plot of leverage values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .hat)) +
  geom_point()

3.10 Index plot of leverage values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .hat)) +
  geom_point() +
  geom_hline(yintercept = 3*(6+1)/200, 
             color = "red",
             linetype = "dashed") +
  geom_hline(yintercept = 2*(6+1)/200,
             color = "blue") +
  geom_text(aes(label=ifelse((.hat > 2*(6+1)/200), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

3.11 Discrepancy

  • Extreme predicted values
    • Discrepancy between observed and predicted values
    • Residual = Observed - Predicted = \(Y - \hat{Y}\)
    • Externally studentized residuals
  • How far is a case from the regression line?
  • ES residual follows \(t\) distribution with \(n - k - 1\) df
    • Cut off: \(\pm3\) for large sample, otherwise \(\pm2\)

3.12 Raw residuals

  • Residual = \(e_i = Y_i - \hat{Y}_i\)

  • Problems

    • Metric depends on the scale of the outcome
    • An outlier hides by pulling the regression line toward itself

3.13 Hiding outlier

Code
x <- c(1:15)
y <- 2 * x + rnorm(15, 0, 8)
dat <- data.frame(x, y)
dat_out <- dat
dat_out[14,2] <- 50
dat_del <- dat_out
dat_del[14,2] <- NA

ggplot(data = dat_out,
       aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(data = dat_out, method = "lm", se = FALSE, color = "black") +
  ylim(0, 60)

3.14 Hiding outlier

Code
ggplot(data = dat_out,
       aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(data = dat_out, method = "lm", se = FALSE, color = "black") +
  geom_smooth(data = dat_del, method = "lm", se = FALSE, color = "red", linetype = "dashed") +
  annotate("rect", xmin = 13.5, xmax = 14.5, ymin = 46, ymax = 54, fill = "red", alpha = 0.2) +
  ylim(0, 60)

3.15 Internally studentized residual

  • “Studentized” = standardized

    • Divide raw residual by its standard deviation
    • \(SD_{e_i} = \sqrt{MS_{error}(1 - h_{ii})}\)
  • Internally studentized residual = \(\frac{e_i}{SD_{e_i}}\)

  • Problems

    • What distribution does this follow???
    • Outlier pulls line toward itself, increasing \(MS_{error}\)

3.16 Externally studentized residual

  • Deleted residual: Difference between observed value and predicted value in a regression in which the point was deleted
  • Deleted residual = \(d_i = Y_i - \hat{Y}_{i(i)}\)
    • Standard deviation based on deleted: \(SD_{d} = \sqrt{MS_{error(i)}(1 - h_{ii})}\)
  • Externally studentized residual = \(\frac{d_i}{SD_{d}}\)

3.17 Discrepancy: Externally studentized residuals

  • rstudent() function from the car package
library(car)
m1_stresid <- ICU %>% mutate(esresid = rstudent(m1))
head(m1_stresid)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency    esresid
1  4       0  87        3   1         1    80    96         1 -1.2312289
2  8       1  27        1   1         1   142    88         1  0.6009965
3 12       1  59        2   0         0   112    80         1 -0.7763291
4 14       1  77        3   0         0   100    70         0 -1.4932232
5 27       0  76        3   1         1   128    90         1  0.3669971
6 28       1  54        2   0         1   142   103         1  0.6059387

3.18 Histogram of ES residuals

ggplot(data = m1_stresid,
       aes(x = esresid)) +
  geom_histogram()

3.19 Index plot of ES residuals

ggplot(data = m1_stresid,
       aes(x = c(1:nrow(m1_stresid)), 
           y = esresid)) +
  geom_point()

3.20 Index plot of ES residuals

ggplot(data = m1_stresid,
       aes(x = c(1:nrow(m1_stresid)), 
           y = esresid)) +
  geom_point() +
  geom_hline(yintercept = 2, 
             color = "blue",
             linetype = "dashed") +
  geom_hline(yintercept = -2,
             color = "blue",
             linetype = "dashed") +
  geom_text(aes(label=ifelse((esresid > 2), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2) +
  geom_text(aes(label=ifelse((esresid < -2), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

3.21 What’s up with ID = 921?

library(kableExtra)
m1_aug %>% filter(ID == 921) %>%
  kable()
SysBP ID Survive Age AgeGroup Sex Infection Pulse Emergency .fitted .resid .hat .sigma .cooksd .std.resid
256 921 0 50 2 1 0 64 1 126.0032 129.9968 0.0492907 30.26998 0.1311792 4.208459

3.22 Influence

  • Changes coefficients and/or predicted values
    • Predicted values: Cook’s D and DFFITS
    • Regression coefficients: DFBETAS
  • Based on case deletion
  • How much does the case change the coefficient(s) and/or the predicted values?
  • Cutoff: Cook’s D > 1

3.23 Influence = leverage x discrepancy

Code
ggplot(data = dat,
       aes(x = x, y = y)) +
  geom_point(size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "black") 

3.24 Histogram of influence values

ggplot(data = m1_aug,
       aes(x = .cooksd)) +
  geom_histogram()

3.25 Index plot of Influence values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .cooksd)) +
  geom_point()

3.26 Index plot of influence values

ggplot(data = m1_aug,
       aes(x = c(1:nrow(m1_aug)), 
           y = .cooksd)) +
  geom_point() +
  geom_hline(yintercept = 1, 
             color = "red",
             linetype = "dashed") +
  geom_text(aes(label=ifelse((.cooksd > 1), 
                             as.character(ID), 
                             '')),
            hjust = 0, nudge_x = 2)

3.27 Summary: Handling outliers

  • Leverage and discrepancy will flag most extreme ~5%
  • Don’t base decisions on a single measure
    • Look for a pattern of suspicious cases
    • Same case shows up with high leverage, high discrepancy, high influence
  • Don’t just delete cases
    • Sensitivity analysis: Same conclusions?

3.28 Repeat offenders?

Leverage

Code
m1_aug %>% filter(.hat > 2*(6+1)/200) %>% select(ID)
# A tibble: 5 × 1
     ID
  <int>
1   202
2   333
3   584
4   732
5   740

Discrepancy

Code
m1_stresid %>% filter(esresid > 2) %>% select(ID)
   ID
1  73
2 271
3 416
4 645
5 921
Code
m1_stresid %>% filter(esresid < -2) %>% select(ID)
   ID
1  84
2 331

Influence

Code
m1_aug %>% filter(.cooksd > 1) %>% select(ID)
# A tibble: 0 × 1
# ℹ 1 variable: ID <int>

4 Next time

4.1 Some details

  • Hypothesis testing
    • Sampling distribution, degrees of freedom
  • Estimation of the model
    • \(R^2\), likelihood
  • Interaction (moderator) effects