Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

Conduct logistic regression for binary outcomes
Interpret results in terms of probability, odds, and log-odds

2 Linear regression review

2.1 Assumptions of linear regression

General linear model (GLM, linear regression, ANOVA) makes three assumptions about the residuals (\(e_i = Y_i - \hat{Y}_i\)) of the model

Independence: observations (i.e., residuals) from different subjects do not depend on one another
Constant variance (homoscedasticity): variance of residuals is same at all values of predictor(s)
Conditional normality: residuals are normally distributed at each value of predictor(s)

2.2 ICU data

library(Stat2Data)
data(ICU)
head(ICU)

  ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
1  4       0  87        3   1         1    80    96         1
2  8       1  27        1   1         1   142    88         1
3 12       1  59        2   0         0   112    80         1
4 14       1  77        3   0         0   100    70         0
5 27       0  76        3   1         1   128    90         1
6 28       1  54        2   0         1   142   103         1

2.3 Linear regression on normal outcome

Code

ggplot(data = ICU, 
       aes(x = Age, 
           y = Pulse)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm")

2.4 Assumptions met!

Code

m1_norm <- lm(Pulse ~ Age, 
              data = ICU)
m1_norm_resid <- augment(m1_norm)
ggplot(data = m1_norm_resid, 
       aes(x = .resid)) +
  geom_histogram(aes(y = ..density..), 
                 fill = "white", 
                 color = "blue") +
  geom_density()

2.5 Assumptions met!

Code

ggplot(data = m1_norm_resid, 
       aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line() +
  labs(x = "Theoretical", 
       y = "Sample")

2.6 Assumptions met!

Code

ggplot(data = m1_norm_resid, 
       aes(x = Age, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth()

3 Binary variable

3.1 A binary variable is not normal

Code

ggplot(data = ICU, 
       aes(x = as.factor(Survive))) + 
  geom_bar()

3.2 Plot of data with fit line

Code

ggplot(data = ICU, 
       aes(x = Age, 
           y = Survive)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", 
              se = FALSE) +
  coord_cartesian(ylim = c(0,1))

3.3 Plot of data with fit line

Code

ggplot(data = ICU, 
       aes(x = Age, 
           y = Survive)) + 
  geom_jitter(height = 0.02, 
              alpha = 0.5) + 
  geom_smooth(method = "lm", 
              se = FALSE) +
  coord_cartesian(ylim = c(0,1))

3.4 Plot of residuals

Code

m1_bin <- lm(Survive ~ Age, 
             data = ICU)
resids_m1_bin <- augment(m1_bin)
ggplot(data = resids_m1_bin, 
       aes(x = .resid)) + 
  geom_histogram(aes(y = ..density..), 
                 fill = "white", 
                 color = "blue") +
  geom_density()

3.5 Plot of residuals

Code

ggplot(data = resids_m1_bin, 
       aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line() +
  labs(x = "Theoretical", 
       y = "Sample")

3.6 Plot of residuals

Code

ggplot(data = resids_m1_bin, 
       aes(x = Age, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth()

3.7 What NOT to do

Ignore the problem
- Do linear regression anyway
- Call it linear probability model
Transform the outcome
- Square root, natural log, etc.
- May slightly normalize univariate residual distribution
- Does not fix heteroscedasticity, (conditional) non-normality

3.8 A binary variable is not normal

3.9 What to do

Generalized linear model (GLiM)
- Not a single model but a family of regression models
- Choose features (e.g., residual distribution) to match the characteristics of your outcome variable
- Accommodates many continuous and categorical outcome variables
- Includes logistic regression and Poisson regression
- Transformation of the predicted outcome

4 Logistic regression

4.1 (Binary) logistic regression

Outcome: binary
- Observed value (\(Y\)): 0 or 1, where 1 = “success” or “event”
- Predicted value (\(\hat{Y}\)): Probability of success, between 0 and 1
Residual/outcome distribution: binomial
Link function: logit (or log-odds) = \(ln\Big(\frac{\hat{Y}}{1 - \hat{Y}}\Big) = ln\left(\frac{\hat{p}}{1-\hat{p}}\right)\)

\[ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p\]

4.2 Reminder: normal distribution

\[f(x) = {\frac {1}{{\sqrt {2\pi \sigma^2}}}}e^{-{\frac {(x-\mu)^2 }{2 \sigma^2 }}}\]

Mean of normal distribution = \(\mu\)

Variance of normal distribution = \(\sigma^2\)

Mean and variance are different parameters and are unrelated
Homoscedasticity: Constant variance doesn’t depend on mean

4.3 Binomial distribution

Probability of \(k\) events in \(n\) trials, each with probability \(p\) of event?

\(p\) = 0.5, \(n\) = 10

Code

n <- 10
p <- 0.5
k <- seq(0, 10, 1)
Prob <- dbinom(k, n, p)
bin0_5 <- data.frame(k, Prob)
ggplot(data = bin0_5, aes(x = k, y = Prob)) + 
  geom_col() +
  scale_x_continuous(breaks=seq(0,10,1)) +
  scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  ylim(0,0.4) +
  labs(y = "P(X = k)")

\(p\) = 0.1, \(n\) = 10

Code

n <- 10
p <- 0.1
k <- seq(0, 10, 1)
Prob <- dbinom(k, n, p)
bin0_2 <- data.frame(k, Prob)
ggplot(data = bin0_2, aes(x = k, y = Prob)) + 
  geom_col() +
  scale_x_continuous(breaks=seq(0,10,1)) +
  scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5)) +
  ylim(0,0.4) + 
  labs(y = "P(X = k)")

4.4 Binomial distribution

\[P(X = k) = {n \choose k} p^k (1-p)^{n-k}\]

Mean of a binomial distribution: \(np\)
Variance of a binomial distribution: \(np(1-p)\)

Mean and variance are related to one another
- They are functions of the same parameters (\(n\) and \(p\))
Heteroscedasticity is built into logistic regression

4.5 Logistic regression: What we model

Linear regression: Model the mean of the outcome
- Observed and predicted outcome are the same type of variable
Logistic regression: Model the probability of a “success” or “event”
- Observed is binary (0 or 1)
- Predicted is probability (from 0 to 1)
- From the probability, we can also get the odds of a success and the logit or log-odds of a success

4.6 Logistic regression model

glm() function in stats package (always loaded)
- data: same as lm()
- formula: same as lm()
- family: Residual distribution and link function

ICU <- ICU %>% mutate(AgeC = Age - mean(Age, na.rm = TRUE))
m1 <- glm(data = ICU, 
          Survive ~ AgeC, 
          family = binomial(link = "logit"))

Code

m1_b0 <- round(coefficients(m1)[1], 3)
m1_b1 <- round(coefficients(m1)[2], 3)

4.7 Logistic regression model

Estimate
Standard error
\(z\) statistic
\(p\)-value
No \(R^2\)
Deviance value

summary(m1)


Call:
glm(formula = Survive ~ AgeC, family = binomial(link = "logit"), 
    data = ICU)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.47357    0.19133   7.702 1.34e-14 ***
AgeC        -0.02754    0.01056  -2.607  0.00913 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 192.31  on 198  degrees of freedom
AIC: 196.31

Number of Fisher Scoring iterations: 4

4.8 Figure: What we model

Code

ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) +
  geom_point() +
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE)

4.9 Three forms of logistic regression

Probability:

\[\hat{p} = \frac{e^{(\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p})}}{1+e^{(\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p})}}\]

Odds:

\[\hat{odds} = \frac{\hat{p}}{1-\hat{p}} = e^{\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}}\]

Logit:

\[ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = \color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}\]

5 Probability metric

5.1 What is probability (\(p\))?

Likelihood of a “success” or “event”
Ranges from 0 to 1 (inclusive)
Both options are equally likely when \(p = 0.5\)

5.2 \(\hat{p} = \frac{e^{1.474 + -0.028 X}}{1 + e^{1.474 + -0.028 X}}\)

ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE)

5.3 \(\hat{p} = \frac{e^{1.474 + -0.028 X}}{1 + e^{1.474 + -0.028 X}}\)

library(ggdist)
ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  #geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE) +
  stat_dots(data = ICU, 
            aes(y = Survive, 
                side = ifelse(Survive == 0, 
                              "top", 
                              "bottom")), 
            scale = 1/5)

5.4 Interpreting probability

Intercept: \(b_0\) is related to the probability of success when X = 0
- \(b_0\) > 0: Success (1) more likely than failure (0) when X = 0
- \(b_0\) < 0: Failure (0) more likely than success (1) when X = 0
Slope: \(b_1\) tells you how predictor X relates to probability of success
- \(b_1\) > 0: Probability of a success increases as X increases
- \(b_1\) < 0: Probability of a success decreases as X increases

5.5 Interpreting probability: Example

\[\hat{p} = \frac{e^{\color{OrangeRed}{1.474} + -0.028 X}}{1 + e^{\color{OrangeRed}{1.474} + -0.028 X}}\]

Interpretation of example intercept
- \(b_0\) > 0: Success (1) more likely than failure (0) when X = 0
- Probability of success when X = 0: \(\frac{e^{\color{OrangeRed}{b_0}}}{1 + e^{\color{OrangeRed}{b_0}}} = \frac{e^{\color{OrangeRed}{1.474}}}{1 + e^{\color{OrangeRed}{1.474}}} =0.814\)
- Probability of survival at mean age = 0.814

5.6 P(success|X=0)

Code

int_eval <- round(eval(exp(m1_b0)) / (1 + exp(m1_b0)), 3)

ggplot(ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE) +
  annotate("segment", 
           x = -2, xend = 0, 
           y = int_eval, yend = int_eval, 
           colour = "black", linetype = "dashed") +
  annotate("segment", 
           x = 0, xend = 0, 
           y = 0, yend = int_eval, 
           colour = "black", linetype = "dashed") +
  annotate("pointrange", 
           x = 0, y = int_eval, 
           ymin = int_eval, ymax = int_eval, 
           colour = "black", size = 1.5) +
  annotate("text", 
           x = 0, y = int_eval + 0.1, 
           colour = "black", label = int_eval, size = 6)

5.7 Interpreting probability: Example

\[\hat{p} = \frac{e^{1.474 + \color{OrangeRed}{-0.028} X}}{1 + e^{1.474 + \color{OrangeRed}{-0.028} X}}\]

Interpretation of example slope
- \(b_1\) < 0: Probability of a success decreases as X increases
- Probability of survival decreases as age increases
  - But it’s non-linear

5.8 Probability metric is non-linear

Linear regression:
- Constant, linear slope
- Slope depends on the slope only
Logistic regression (probability):
- Non-linear slope
- No interaction but conditional effect

5.9 Probability metric is non-linear

AgeC	Predicted probability
-30	0.9100
-20	0.8843
-10	0.8525
0	0.8137
10	0.7675
20	0.7138
30	0.6534

5.10 Probability metric is non-linear

Code

ggplot(data = ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE)

5.11 Probability metric is non-linear

Code

ggplot(data = ICU, 
       aes(x=AgeC, 
           y=Survive)) + 
  geom_point() + 
  stat_smooth(method="glm", 
              method.args=list(family="binomial"), 
              se=FALSE, 
              fullrange = T) +
  xlim(-100, 250)

5.12 A caution about probability equation

Warning

You might also see the probability defined as \(\hat{p} = \frac{1}{1 + e^{-({b_{0} + b_{1} X})}}\)

Or more generally, \(\hat{p} = \frac{1}{1 + e^{-(Xb)}}\)

These are numerically equivalent to what we’ve talked about
- But did you notice the negative sign?
- No? You didn’t expect it and missed it in the complicated equation?
- Yeah, that’s why we don’t use this version

6 Odds metric

6.1 What are odds?

Odds is the ratio of two probabilities

\[odds = \frac{\hat{p}}{(1 - \hat{p})}\]

Odds is the ratio of probability of a “success” (\(\hat{p}\)) to the probability of “not a success” \((1 − \hat{p})\)

6.2 How do odds work?

Probability ranges from 0 to 1, switches at 0.5
- Success more likely than failure when \(p > 0.5\)
- Success less likely than failure when \(p < 0.5\)
Odds range from \(0\) to \(+\infty\), switches at 1
- Success more likely than failure when \(odds > 1\)
- Success less likely than failure when \(odds < 1\)

6.3 \(\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{1.474 + -0.028 X}\)

Code

ICU <- ICU %>% mutate(pred_odds = exp(m1_b0 + m1_b1 * AgeC),
                      pred_p = exp(m1_b0 + m1_b1 * AgeC) / (1 + exp(m1_b0 + m1_b1 * AgeC)),
                      obs_odds = pred_p / (1 - pred_p))

ggplot(data = ICU, 
       aes(x = AgeC, 
           y = obs_odds)) +  
  geom_line(color="blue", 
            aes(x = AgeC, 
                y = pred_odds)) +
  labs (y = "Odds") +
  annotate("segment", 
           x = -40, xend = 35, 
           y = 1, yend = 1, 
           colour = "black", linetype = "dashed")

6.4 Interpreting odds

Intercept: \(b_0\) is related to the odds of success when \(X\) = 0
- Odds of success when X = 0: \(e^{b_0}\)
- \(b_0\) > 0: Odds of success > 1 when \(X\) = 0
- \(b_0\) < 0: Odds of success < 1 when \(X\) = 0
Slope: \(b_1\) tells you how predictor X relates to odds of success
- \(b_1\) > 0: Odds of success increases as \(X\) increases
- \(b_1\) < 0: Odds of a success decreases as \(X\) increases

6.5 Interpreting odds: Example

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{\color{OrangeRed}{1.474} + -0.028 X}\]

Interpretation of example intercept
- \(b_0 > 0\) so odds of success > 1 when \(X\) = 0
- Success (1) more likely than failure (0) when \(X\) = 0
- Odds of success when \(X\) = 0: \(e^{\color{OrangeRed}{b_0}} = e^{\color{OrangeRed}{1.474}} = 4.37\)
  - A “success” is about 4.37 times as likely as a “failure”
  - Compare to 0.814 probability of success: 0.814 / 0.186 = 4.38

6.6 Interpreting odds: Example

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{1.474 + \color{OrangeRed}{-0.028} X}\]

Interpretation of example slope
- \(b_1\) < 0: Odds of a success decreases as \(X\) increases
- Odds of survival decreases as age increases
  - But it’s non-linear

6.7 Odds metric is nonlinear

This non-linear change is presented in terms of odds ratio
- Constant, multiplicative change in predicted odds
- For a 1-unit difference in \(X\), the predicted odds of success is multiplied by the odds ratio
Example: odds ratio \(= e^{b_1}= e^{-0.028} = 0.972\)
- For a 1-unit difference in \(X\), the predicted odds of success is multiplied by \(0.972\)
- Each 1-unit increase in \(X\) reduces the odds of a success

6.8 Odds ratio = 0.972)

Code

ggplot(data = ICU, 
       aes(x = AgeC, 
           y = obs_odds)) +  
  geom_line(color="blue", 
            aes(x = AgeC, 
                y = pred_odds)) +
  labs (y = "Odds") +
  annotate("segment", 
           x = -40, xend = 35, 
           y = 1, yend = 1, 
           colour = "black", linetype = "dashed")

6.9 Odds metric is nonlinear

AgeC	Predicted probability	Predicted odds
-30	0.9100	10.1148
-20	0.8843	7.6446
-10	0.8525	5.7777
0	0.8137	4.3667
10	0.7675	3.3003
20	0.7138	2.4943
30	0.6534	1.8851

6.10 A caution about odds

Warning

Odds ratios are very popular in medicine and epidemiology
They can be extremely misleading
The same odds ratio corresponds to many different probability values
- Odds ratio \(= \frac{odds = 3}{odds = 1} = 3\)
  - Corresponds to probability of 0.75 vs 0.5
- Odds ratio \(= \frac{odds = 9}{odds = 3} = 3\)
  - Corresponds to probability of 0.90 vs 0.75

7 Logit or log-odds metric

7.1 What is the logit?

Logit or log-odds is the natural log (\(ln\)) of the odds
- As probability of “success” increases (nonlinearly, S-shaped curve)
- The odds of “success” increases (also nonlinearly, exponentially)
- The logit of “success” increases linearly

7.2 How does the logit work?

Probability ranges from 0 to 1, switches at 0.5
Odds range from 0 to \(+\infty\) , switches at 1
Logit ranges from \(-\infty\) to \(+\infty\), switches at 0
- Success more likely than failure when logit > 0
- Success less likely than failure when logit < 0

7.3 \(ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 1.474 + -0.028 X\)

Code

ggplot(ICU, aes(x=AgeC, y=obs_odds)) +  
  geom_line(color="blue", aes(x = AgeC, y = log(pred_odds))) +
  labs (y = "Logit") +
  annotate("segment", x = -40, xend = 35, y = 0, yend = 0, colour = "black", linetype = "dashed")

7.4 Interpreting logit

Intercept: \(b_0\) is related to the logit of success when X = 0
- Logit of success when X = 0: \(b_0\)
  - \(b_0\) > 0: Logit > 0 when X = 0
  - \(b_0\) < 0: Logit < 0 when X = 0
Slope : \(b_1\) is the relationship between predictor X and logit of success
- \(b_1\) > 0: Logit of a success increases as X increases
- \(b_1\) < 0: Logit of a success decreases as X increases

7.5 Interpreting logit: Example

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = \color{OrangeRed}{1.474} + -0.028 X\]

Interpretation of example intercept
- \(b_0\) > 0: Logit > 0 when X = 0
- Logit of success when X = 0: \(\color{OrangeRed}{b_0} = \color{OrangeRed}{1.474}\)
- Success more likely than failure when X = 0

7.6 Interpreting logit: Example

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 1.474 + \color{OrangeRed}{-0.028} X\]

Interpretation of example slope
- \(b_1\) < 0: Logit of a success decreases by \(\color{OrangeRed}{-0.028}\) units when X increases by 1 unit

8 Miscellaneous

8.1 So which metric should I use?

They are equivalent and give you the same information
- Just packaged differently
Use the metric that
- Makes the most sense to you
- You can explain fully
- Is most commonly used in your field

8.2 Metrics: Some things to keep in mind

Odds ratios tell you about change, but not where you start
- If you report odds ratios, also report some measure of probability e.g., probability of success at the mean of X
- 10x change is \(5%\) to \(50%\) or \(0.05%\) to \(0.5%\)?
Logit is nice because it’s linear, but it’s not very interpretable
- What is a “logit”? It’s just a mathematical concept that makes a straight line – not actually meaningful
- But many measures don’t have meaningful metrics…

8.3 Why a \(z\)-statistic?

Maximum likelihood estimation
- Estimates are assumed to be normally distributed
Known variance
- Binomial distribution: mean = \(np\), variance = \(np(1-p)\)
- Use \(t\) distribution when you need to estimate variance
  - Use \(z\) distribution when you don’t

8.4 Predictors in logistic regression

Similar to linear regression
- Continuous: Center for interpretability
- Categorical: Code appropriately
- Coefficients are partial coefficients
Tests are \(z\)-tests, not \(t\)-tests

8.5 Models in logistic regression

Add predictors in sets
- Likelihood ratio tests to compare models
Coefficients are in logit metric
- So are the confidence intervals
- Confidenct intervals in other metrics
\(R^2_{multiple}\) analogue
Packages to make some of these things easier

8.6 Interactions in logistic regression

Effects (in probability metric) are already nonlinear
- Already conditional
Interactions (also conditional) are complicated here
- Conditional on top of conditional
- Can have interactions without a product term
- Can have no interaction with a significant product term
See McCabe et al. (2022): Full reference on the References page

9 Summary

9.1 Summary

Use logistic regression when your outcome is binary
- Don’t use linear regression
Be careful with interpretation no matter what
- Probability: Probability makes sense, but it’s nonlinear
- Odds: Odds ratio seems to make sense but it can be misleading
- Logit: Linear but what even is a logit?
But many basic concepts parallel linear regression
- Intercept, slope(s), adding predictors, \(R^2_{multiple}\)

9.2 In class

Run some logistic regression models
Interpret the results
Examples of the things we didn’t get to
- \(R^2_{multiple}\), compare models, report full results

9.3 Next time

~~Logistic regression for binary outcomes~~
Poisson regression for count outcomes
Survival analysis for survival outcomes