Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Conduct logistic regression for binary outcomes
  • Interpret results in terms of probability, odds, and log-odds

2 Linear regression review

2.1 Assumptions of linear regression

General linear model (GLM, linear regression, ANOVA) makes three assumptions about the residuals (\(e_i = Y_i - \hat{Y}_i\)) of the model

  1. Independence: observations (i.e., residuals) from different subjects do not depend on one another
  2. Constant variance (homoscedasticity): variance of residuals is same at all values of predictor(s)
  3. Conditional normality: residuals are normally distributed at each value of predictor(s)

2.2 ICU data

library(Stat2Data)
data(ICU)
head(ICU)
  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

2.3 Linear regression on normal outcome

Code
ggplot(data = ICU, 
       aes(x = Age, 
           y = Pulse)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm")

2.4 Assumptions met!

Code
m1_norm <- lm(Pulse ~ Age, 
              data = ICU)
m1_norm_resid <- augment(m1_norm)
ggplot(data = m1_norm_resid, 
       aes(x = .resid)) +
  geom_histogram(aes(y = ..density..), 
                 fill = "white", 
                 color = "blue") +
  geom_density()

2.5 Assumptions met!

Code
ggplot(data = m1_norm_resid, 
       aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line() +
  labs(x = "Theoretical", 
       y = "Sample")

2.6 Assumptions met!

Code
ggplot(data = m1_norm_resid, 
       aes(x = Age, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth()

3 Binary variable

3.1 A binary variable is not normal

Code
ggplot(data = ICU, 
       aes(x = as.factor(Survive))) + 
  geom_bar()

3.2 Plot of data with fit line

Code
ggplot(data = ICU, 
       aes(x = Age, 
           y = Survive)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", 
              se = FALSE) +
  coord_cartesian(ylim = c(0,1)) 

3.3 Plot of data with fit line

Code
ggplot(data = ICU, 
       aes(x = Age, 
           y = Survive)) + 
  geom_jitter(height = 0.02, 
              alpha = 0.5) + 
  geom_smooth(method = "lm", 
              se = FALSE) +
  coord_cartesian(ylim = c(0,1)) 

3.4 Plot of residuals

Code
m1_bin <- lm(Survive ~ Age, 
             data = ICU)
resids_m1_bin <- augment(m1_bin)
ggplot(data = resids_m1_bin, 
       aes(x = .resid)) + 
  geom_histogram(aes(y = ..density..), 
                 fill = "white", 
                 color = "blue") +
  geom_density() 

3.5 Plot of residuals

Code
ggplot(data = resids_m1_bin, 
       aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line() +
  labs(x = "Theoretical", 
       y = "Sample")

3.6 Plot of residuals

Code
ggplot(data = resids_m1_bin, 
       aes(x = Age, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth()

3.7 What NOT to do

  • Ignore the problem

    • Do linear regression anyway
    • Call it linear probability model
  • Transform the outcome

    • Square root, natural log, etc.
    • May slightly normalize univariate residual distribution
    • Does not fix heteroscedasticity, (conditional) non-normality

3.8 A binary variable is not normal

Distribution of binary variable

3.9 What to do

  • Generalized linear model (GLiM)
    • Not a single model but a family of regression models
    • Choose features (e.g., residual distribution) to match the characteristics of your outcome variable
    • Accommodates many continuous and categorical outcome variables
    • Includes logistic regression and Poisson regression
    • Transformation of the predicted outcome

4 Logistic regression

4.1 (Binary) logistic regression

  • Outcome: binary
    • Observed value (\(Y\)): 0 or 1, where 1 = “success” or “event”
    • Predicted value (\(\hat{Y}\)): Probability of success, between 0 and 1
  • Residual/outcome distribution: binomial
  • Link function: logit (or log-odds) = \(ln\Big(\frac{\hat{Y}}{1 - \hat{Y}}\Big) = ln\left(\frac{\hat{p}}{1-\hat{p}}\right)\)

\[ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p\]

4.2 Reminder: normal distribution

\[f(x) = {\frac {1}{{\sqrt {2\pi \sigma^2}}}}e^{-{\frac {(x-\mu)^2 }{2 \sigma^2 }}}\]

Mean of normal distribution = \(\mu\)

Variance of normal distribution = \(\sigma^2\)

  • Mean and variance are different parameters and are unrelated
  • Homoscedasticity: Constant variance doesn’t depend on mean

4.3 Binomial distribution

Probability of \(k\) events in \(n\) trials, each with probability \(p\) of event?

  • \(p\) = 0.5, \(n\) = 10
Code
n <- 10
p <- 0.5
k <- seq(0, 10, 1)
Prob <- dbinom(k, n, p)
bin0_5 <- data.frame(k, Prob)
ggplot(data = bin0_5, aes(x = k, y = Prob)) + 
  geom_col() +
  scale_x_continuous(breaks=seq(0,10,1)) +
  scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  ylim(0,0.4) +
  labs(y = "P(X = k)")

  • \(p\) = 0.1, \(n\) = 10
Code
n <- 10
p <- 0.1
k <- seq(0, 10, 1)
Prob <- dbinom(k, n, p)
bin0_2 <- data.frame(k, Prob)
ggplot(data = bin0_2, aes(x = k, y = Prob)) + 
  geom_col() +
  scale_x_continuous(breaks=seq(0,10,1)) +
  scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  ylim(0,0.4) + 
  labs(y = "P(X = k)")

4.4 Binomial distribution

\[P(X = k) = {n \choose k} p^k (1-p)^{n-k}\]

Mean of a binomial distribution: \(np\)
Variance of a binomial distribution: \(np(1-p)\)

  • Mean and variance are related to one another

    • They are functions of the same parameters (\(n\) and \(p\))
  • Heteroscedasticity is built into logistic regression

4.5 Logistic regression: What we model

  • Linear regression: Model the mean of the outcome
    • Observed and predicted outcome are the same type of variable
  • Logistic regression: Model the probability of a “success” or “event”
    • Observed is binary (0 or 1)
    • Predicted is probability (from 0 to 1)
    • From the probability, we can also get the odds of a success and the logit or log-odds of a success

4.6 Logistic regression model

  • glm() function in stats package (always loaded)
    • data: same as lm()
    • formula: same as lm()
    • family: Residual distribution and link function
ICU <- ICU %>% mutate(AgeC = Age - mean(Age, na.rm = TRUE))
m1 <- glm(data = ICU, 
          Survive ~ AgeC, 
          family = binomial(link = "logit"))
Code
m1_b0 <- round(coefficients(m1)[1], 3)
m1_b1 <- round(coefficients(m1)[2], 3)

4.7 Logistic regression model

  • Estimate
  • Standard error
  • \(z\) statistic
  • \(p\)-value
  • No \(R^2\)
  • Deviance value
summary(m1)

Call:
glm(formula = Survive ~ AgeC, family = binomial(link = "logit"), 
    data = ICU)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.47357    0.19133   7.702 1.34e-14 ***
AgeC        -0.02754    0.01056  -2.607  0.00913 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 192.31  on 198  degrees of freedom
AIC: 196.31

Number of Fisher Scoring iterations: 4

4.8 Figure: What we model

Code
ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) +
  geom_point() +
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE)

4.9 Three forms of logistic regression

Probability:

\[\hat{p} = \frac{e^{(\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p})}}{1+e^{(\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p})}}\]

Odds:

\[\hat{odds} = \frac{\hat{p}}{1-\hat{p}} = e^{\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}}\]

Logit:

\[ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = \color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}\]

5 Probability metric

5.1 What is probability (\(p\))?

  • Likelihood of a “success” or “event”
  • Ranges from 0 to 1 (inclusive)
  • Both options are equally likely when \(p = 0.5\)

5.2 \(\hat{p} = \frac{e^{1.474 + -0.028 X}}{1 + e^{1.474 + -0.028 X}}\)

ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE) 

5.3 \(\hat{p} = \frac{e^{1.474 + -0.028 X}}{1 + e^{1.474 + -0.028 X}}\)

library(ggdist)
ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  #geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE) +
  stat_dots(data = ICU, 
            aes(y = Survive, 
                side = ifelse(Survive == 0, 
                              "top", 
                              "bottom")), 
            scale = 1/5)

5.4 Interpreting probability

  • Intercept: \(b_0\) is related to the probability of success when X = 0
    • \(b_0\) > 0: Success (1) more likely than failure (0) when X = 0
    • \(b_0\) < 0: Failure (0) more likely than success (1) when X = 0
  • Slope: \(b_1\) tells you how predictor X relates to probability of success
    • \(b_1\) > 0: Probability of a success increases as X increases
    • \(b_1\) < 0: Probability of a success decreases as X increases

5.5 Interpreting probability: Example

\[\hat{p} = \frac{e^{\color{OrangeRed}{1.474} + -0.028 X}}{1 + e^{\color{OrangeRed}{1.474} + -0.028 X}}\]

  • Interpretation of example intercept
    • \(b_0\) > 0: Success (1) more likely than failure (0) when X = 0
    • Probability of success when X = 0: \(\frac{e^{\color{OrangeRed}{b_0}}}{1 + e^{\color{OrangeRed}{b_0}}} = \frac{e^{\color{OrangeRed}{1.474}}}{1 + e^{\color{OrangeRed}{1.474}}} =0.814\)
    • Probability of survival at mean age = 0.814

5.6 P(success|X=0)

Code
int_eval <- round(eval(exp(m1_b0)) / (1 + exp(m1_b0)), 3)

ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE) +
  annotate("segment", 
           x = -2, xend = 0, 
           y = int_eval, yend = int_eval, 
           colour = "black", linetype = "dashed") +
  annotate("segment", 
           x = 0, xend = 0, 
           y = 0, yend = int_eval, 
           colour = "black", linetype = "dashed") +
  annotate("pointrange", 
           x = 0, y = int_eval, 
           ymin = int_eval, ymax = int_eval, 
           colour = "black", size = 1.5) +
  annotate("text", 
           x = 0, y = int_eval + 0.1, 
           colour = "black", label = int_eval, size = 6)

5.7 Interpreting probability: Example

\[\hat{p} = \frac{e^{1.474 + \color{OrangeRed}{-0.028} X}}{1 + e^{1.474 + \color{OrangeRed}{-0.028} X}}\]

  • Interpretation of example slope
    • \(b_1\) < 0: Probability of a success decreases as X increases
    • Probability of survival decreases as age increases
      • But it’s non-linear

5.8 Probability metric is non-linear

  • Linear regression:
    • Constant, linear slope
    • Slope depends on the slope only
  • Logistic regression (probability):
    • Non-linear slope
    • No interaction but conditional effect

5.9 Probability metric is non-linear

AgeC Predicted probability
-30 0.9100
-20 0.8843
-10 0.8525
0 0.8137
10 0.7675
20 0.7138
30 0.6534

5.10 Probability metric is non-linear

Code
ggplot(data = ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE)

5.11 Probability metric is non-linear

Code
ggplot(data = ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE, 
              fullrange = T) +
  xlim(-100, 250)

5.12 A caution about probability equation

Warning

You might also see the probability defined as \(\hat{p} = \frac{1}{1 + e^{-({b_{0} + b_{1} X})}}\)

Or more generally, \(\hat{p} = \frac{1}{1 + e^{-(Xb)}}\)

  • These are numerically equivalent to what we’ve talked about

    • But did you notice the negative sign?
    • No? You didn’t expect it and missed it in the complicated equation?
    • Yeah, that’s why we don’t use this version

6 Odds metric

6.1 What are odds?

  • Odds is the ratio of two probabilities

\[odds = \frac{\hat{p}}{(1 - \hat{p})}\]

  • Odds is the ratio of probability of a “success” (\(\hat{p}\)) to the probability of “not a success” \((1 − \hat{p})\)

6.2 How do odds work?

  • Probability ranges from 0 to 1, switches at 0.5

    • Success more likely than failure when \(p > 0.5\)
    • Success less likely than failure when \(p < 0.5\)
  • Odds range from \(0\) to \(+\infty\), switches at 1

    • Success more likely than failure when \(odds > 1\)
    • Success less likely than failure when \(odds < 1\)

6.3 \(\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{1.474 + -0.028 X}\)

Code
ICU <- ICU %>% mutate(pred_odds = exp(m1_b0 + m1_b1 * AgeC),
                      pred_p = exp(m1_b0 + m1_b1 * AgeC) / (1 + exp(m1_b0 + m1_b1 * AgeC)),
                      obs_odds = pred_p / (1 - pred_p))

ggplot(data = ICU, 
       aes(x = AgeC, 
           y = obs_odds)) +  
  geom_line(color="blue", 
            aes(x = AgeC, 
                y = pred_odds)) +
  labs (y = "Odds") +
  annotate("segment", 
           x = -40, xend = 35, 
           y = 1, yend = 1, 
           colour = "black", linetype = "dashed")

6.4 Interpreting odds

  • Intercept: \(b_0\) is related to the odds of success when \(X\) = 0
    • Odds of success when X = 0: \(e^{b_0}\)
    • \(b_0\) > 0: Odds of success > 1 when \(X\) = 0
    • \(b_0\) < 0: Odds of success < 1 when \(X\) = 0
  • Slope: \(b_1\) tells you how predictor X relates to odds of success
    • \(b_1\) > 0: Odds of success increases as \(X\) increases
    • \(b_1\) < 0: Odds of a success decreases as \(X\) increases

6.5 Interpreting odds: Example

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{\color{OrangeRed}{1.474} + -0.028 X}\]

  • Interpretation of example intercept
    • \(b_0 > 0\) so odds of success > 1 when \(X\) = 0
    • Success (1) more likely than failure (0) when \(X\) = 0
    • Odds of success when \(X\) = 0: \(e^{\color{OrangeRed}{b_0}} = e^{\color{OrangeRed}{1.474}} = 4.37\)
      • A “success” is about 4.37 times as likely as a “failure”
      • Compare to 0.814 probability of success: 0.814 / 0.186 = 4.38

6.6 Interpreting odds: Example

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{1.474 + \color{OrangeRed}{-0.028} X}\]

  • Interpretation of example slope
    • \(b_1\) < 0: Odds of a success decreases as \(X\) increases
    • Odds of survival decreases as age increases
      • But it’s non-linear

6.7 Odds metric is nonlinear

  • This non-linear change is presented in terms of odds ratio
    • Constant, multiplicative change in predicted odds
    • For a 1-unit difference in \(X\), the predicted odds of success is multiplied by the odds ratio
  • Example: odds ratio \(= e^{b_1}= e^{-0.028} = 0.972\)
    • For a 1-unit difference in \(X\), the predicted odds of success is multiplied by \(0.972\)
    • Each 1-unit increase in \(X\) reduces the odds of a success

6.8 Odds ratio = 0.972)

Code
ggplot(data = ICU, 
       aes(x = AgeC, 
           y = obs_odds)) +  
  geom_line(color="blue", 
            aes(x = AgeC, 
                y = pred_odds)) +
  labs (y = "Odds") +
  annotate("segment", 
           x = -40, xend = 35, 
           y = 1, yend = 1, 
           colour = "black", linetype = "dashed")

6.9 Odds metric is nonlinear

AgeC Predicted probability Predicted odds
-30 0.9100 10.1148
-20 0.8843 7.6446
-10 0.8525 5.7777
0 0.8137 4.3667
10 0.7675 3.3003
20 0.7138 2.4943
30 0.6534 1.8851

6.10 A caution about odds

Warning

  • Odds ratios are very popular in medicine and epidemiology
  • They can be extremely misleading
  • The same odds ratio corresponds to many different probability values
    • Odds ratio \(= \frac{odds = 3}{odds = 1} = 3\)
      • Corresponds to probability of 0.75 vs 0.5
    • Odds ratio \(= \frac{odds = 9}{odds = 3} = 3\)
      • Corresponds to probability of 0.90 vs 0.75

7 Logit or log-odds metric

7.1 What is the logit?

  • Logit or log-odds is the natural log (\(ln\)) of the odds
    • As probability of “success” increases (nonlinearly, S-shaped curve)
    • The odds of “success” increases (also nonlinearly, exponentially)
    • The logit of “success” increases linearly

7.2 How does the logit work?

  • Probability ranges from 0 to 1, switches at 0.5
  • Odds range from 0 to \(+\infty\) , switches at 1
  • Logit ranges from \(-\infty\) to \(+\infty\), switches at 0
    • Success more likely than failure when logit > 0
    • Success less likely than failure when logit < 0

7.3 \(ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 1.474 + -0.028 X\)

Code
ggplot(ICU, aes(x=AgeC, y=obs_odds)) +  
  geom_line(color="blue", aes(x = AgeC, y = log(pred_odds))) +
  labs (y = "Logit") +
  annotate("segment", x = -40, xend = 35, y = 0, yend = 0, colour = "black", linetype = "dashed")

7.4 Interpreting logit

  • Intercept: \(b_0\) is related to the logit of success when X = 0
    • Logit of success when X = 0: \(b_0\)
      • \(b_0\) > 0: Logit > 0 when X = 0
      • \(b_0\) < 0: Logit < 0 when X = 0
  • Slope : \(b_1\) is the relationship between predictor X and logit of success
    • \(b_1\) > 0: Logit of a success increases as X increases
    • \(b_1\) < 0: Logit of a success decreases as X increases

7.5 Interpreting logit: Example

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = \color{OrangeRed}{1.474} + -0.028 X\]

  • Interpretation of example intercept
    • \(b_0\) > 0: Logit > 0 when X = 0
    • Logit of success when X = 0: \(\color{OrangeRed}{b_0} = \color{OrangeRed}{1.474}\)
    • Success more likely than failure when X = 0

7.6 Interpreting logit: Example

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 1.474 + \color{OrangeRed}{-0.028} X\]

  • Interpretation of example slope
    • \(b_1\) < 0: Logit of a success decreases by \(\color{OrangeRed}{-0.028}\) units when X increases by 1 unit

8 Miscellaneous

8.1 So which metric should I use?

  • They are equivalent and give you the same information
    • Just packaged differently
  • Use the metric that
    • Makes the most sense to you
    • You can explain fully
    • Is most commonly used in your field

8.2 Metrics: Some things to keep in mind

  • Odds ratios tell you about change, but not where you start

    • If you report odds ratios, also report some measure of probability e.g., probability of success at the mean of X
    • 10x change is \(5%\) to \(50%\) or \(0.05%\) to \(0.5%\)?
  • Logit is nice because it’s linear, but it’s not very interpretable

    • What is a “logit”? It’s just a mathematical concept that makes a straight line – not actually meaningful
    • But many measures don’t have meaningful metrics…

8.3 Why a \(z\)-statistic?

  • Maximum likelihood estimation
    • Estimates are assumed to be normally distributed
  • Known variance
    • Binomial distribution: mean = \(np\), variance = \(np(1-p)\)
    • Use \(t\) distribution when you need to estimate variance
      • Use \(z\) distribution when you don’t

8.4 Predictors in logistic regression

  • Similar to linear regression
    • Continuous: Center for interpretability
    • Categorical: Code appropriately
    • Coefficients are partial coefficients
  • Tests are \(z\)-tests, not \(t\)-tests

8.5 Models in logistic regression

  • Add predictors in sets
    • Likelihood ratio tests to compare models
  • Coefficients are in logit metric
    • So are the confidence intervals
    • Confidenct intervals in other metrics
  • \(R^2_{multiple}\) analogue
  • Packages to make some of these things easier

8.6 Interactions in logistic regression

  • Effects (in probability metric) are already nonlinear
    • Already conditional
  • Interactions (also conditional) are complicated here
    • Conditional on top of conditional
    • Can have interactions without a product term
    • Can have no interaction with a significant product term
  • See McCabe et al. (2022): Full reference on the References page

9 Summary

9.1 Summary

  • Use logistic regression when your outcome is binary

    • Don’t use linear regression
  • Be careful with interpretation no matter what

    • Probability: Probability makes sense, but it’s nonlinear
    • Odds: Odds ratio seems to make sense but it can be misleading
    • Logit: Linear but what even is a logit?
  • But many basic concepts parallel linear regression

    • Intercept, slope(s), adding predictors, \(R^2_{multiple}\)

9.2 In class

  • Run some logistic regression models
  • Interpret the results
  • Examples of the things we didn’t get to
    • \(R^2_{multiple}\), compare models, report full results

9.3 Next time

  • Logistic regression for binary outcomes
  • Poisson regression for count outcomes
  • Survival analysis for survival outcomes