Multivariate: Logistic regression

1 Goals

1.1 Goals

1.1.1 Goals of this lecture

  • My outcome variable isn’t normally distributed

    • It’s binary!!!

    • Two mutually exclusive categories

      • yes/no, pass/fail, diagnosed/not, etc.
    • Linear regression assumptions are violated

  • Use logistic regression to analyze the outcome

    • It’s an extension of linear regression, so many of the same concepts still apply

2 Linear regression and extensions

2.1 Review: Linear regression

2.1.1 Assumptions of linear regression

General linear model (GLM, linear regression, ANOVA) makes three assumptions about the residuals (\(e_i = Y_i - \hat{Y}_i\)) of the model

  1. Independence: observations (i.e., residuals) from different subjects do not depend on one another
  2. Constant variance (homoscedasticity): variance of residuals is same at all values of predictor(s)
  3. Conditional normality: residuals are normally distributed at each value of predictor(s)

2.1.2 Linear regression on normal outcome

Relationship between x and y with a linear fit

2.1.3 Assumptions met!

Histogram of the residuals

2.1.4 Assumptions met!

QQ plot of residuals to demonstrate normality

2.1.5 Assumptions met!

Residuals vs predictor with loess line to show no relation

2.2 Linear regression with a binary variable

2.2.1 A binary variable is not normal

Distribution of binary variable

2.2.2 Plot of data with fit line

Binary outcome vs predictor with linear fit

2.2.3 Plot of data with fit line

Binary outcome vs predictor with linear fit

2.2.4 Plot of residuals

Histogram of the residuals

2.2.5 Plot of residuals

QQ plot of residuals to demonstrate non-normality

2.2.6 Plot of residuals

Residuals vs predictor with loess line

2.3 Next steps

2.3.1 What NOT to do

  • Ignore the problem

    • Do linear regression anyway
    • Call it linear probability model
  • Transform the outcome

    • Square root, natural log, etc.
    • May slightly normalize univariate residual distribution
    • Does not fix heteroscedasticity, (conditional) non-normality

2.3.2 A binary variable is not normal

Distribution of binary variable

2.3.3 What to do

The generalized linear model (GLiM)

  • Not a single model but a family of regression models
  • Choose features (e.g., residual distribution) to match the characteristics of your outcome variable
  • Accommodates many continuous and categorical outcome variables
  • Includes logistic regression and Poisson regression

3 Logistic regression

3.1 Logistic regression

3.1.1 (Binary) logistic regression

  • Outcome: binary
    • Observed value (\(Y\)): 0 or 1, where 1 = “success” or “event”
    • Predicted value (\(\hat{Y}\)): Probability of success, between 0 and 1
  • Residual distribution: binomial
  • Link function: logit (or log-odds) = \(ln\Big(\frac{\hat{Y}}{1 - \hat{Y}}\Big)\)

\[ln\left(\frac{\hat{Y}}{1-\hat{Y}}\right) = ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p\]

3.1.2 Reminder: normal distribution

\[f(x) = {\frac {1}{{\sqrt {2\pi \sigma^2}}}}e^{-{\frac {(x-\mu)^2 }{2 \sigma^2 }}}\]

Mean of normal distribution = \(\mu\)

Variance of normal distribution = \(\sigma^2\)

  • Mean and variance are different parameters and are unrelated

3.1.3 Binomial distribution

\[P(X = k) = {n \choose k} p^k (1-p)^{n-k}\]

  • \(n\) is the sample size
  • \(p\) is the probability of an event
  • \(k\) is the observed number of events
  • \({n \choose k} = \frac{n!}{k!(n-k!)}\) and is read as “\(n\) choose \(k\)

3.1.4 Binomial distribution

What is the probability of having \(k\) events in \(n\) trials, each of which has probability \(p\) of being an “event”?

  • \(p\) = 0.5, \(n\) = 10

  • \(p\) = 0.1, \(n\) = 10

3.1.5 Binomial distribution

\[P(X = k) = {n \choose k} p^k (1-p)^{n-k}\]

Mean of a binomial distribution: \(np\)
Variance of a binomial distribution: \(np(1-p)\)

  • Mean and variance are related to one another

    • They are functions of the same parameters (\(n\) and \(p\))
  • Heteroscedasticity is built into logistic regression

3.1.6 Logistic regression: What we model

  • Linear regression: Model the mean of the outcome (conditional on predictors(s))

  • Logistic regression: Model the probability of a “success” or “event” (conditional on predictor(s))

    • From the probability, we can also get the odds of a success and the logit or log-odds of a success

3.1.7 Figure: What we model

3.1.8 Three forms of logistic regression

Probability:

\[\hat{p} = \frac{e^{(\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p})}}{1+e^{(\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p})}}\]

Odds:

\[\hat{odds} = \frac{\hat{p}}{1-\hat{p}} = e^{\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}}\]

Logit:

\[ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = \color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}\]

3.2 Probability metric

3.2.1 What is probability (\(p\))?

  • Likelihood of a “success” or “event”
  • Ranges from 0 to 1
  • Both options are equally likely when \(p = 0.5\)

3.2.2 \(\hat{p} = \frac{e^{0.251 + 1.219 X}}{1 + e^{0.251 + 1.219 X}}\)

3.2.3 \(\hat{p} = \frac{e^{0.251 + 1.219 X}}{1 + e^{0.251 + 1.219 X}}\)

3.2.4 Probability metric interpretation: General

\[\hat{p} = \frac{e^{0.251 + 1.219 X}}{1 + e^{0.251 + 1.219 X}}\]

General interpretation of intercept:

\(b_0\) is related to the probability of success when X = 0

  • \(b_0\) > 0: Success (1) more likely than failure (0) when X = 0
  • \(b_0\) < 0: Failure (0) more likely than success (1) when X = 0

3.2.5 Probability metric interpretation: General

\[\hat{p} = \frac{e^{0.251 + 1.219 X}}{1 + e^{0.251 + 1.219 X}}\]

General interpretation of slope:

\(b_1\) tells you how predictor X relates to probability of success

  • \(b_1\) > 0: Probability of a success increases as X increases
  • \(b_1\) < 0: Probability of a success decreases as X increases

3.2.6 Probability metric interpretation: Example

\[\hat{p} = \frac{e^{\color{OrangeRed}{0.251} + 1.219 X}}{1 + e^{\color{OrangeRed}{0.251} + 1.219 X}}\]

Interpretation of example intercept:

  • \(b_0\) > 0: Success (1) more likely than failure (0) when X = 0
  • Probability of success when X = 0:

\(\frac{e^{\color{OrangeRed}{b_0}}}{1 + e^{\color{OrangeRed}{b_0}}} = \frac{e^{\color{OrangeRed}{0.251}}}{1 + e^{\color{OrangeRed}{0.251}}} =0.562\)

3.2.7 Probability metric interpretation: Example

\[\hat{p} = \frac{e^{0.251 + \color{OrangeRed}{1.219} X}}{1 + e^{0.251 + \color{OrangeRed}{1.219} X}}\]

Interpretation of example slope:

  • \(b_1\) > 0: Probability of a success increases as X increases

3.2.8 P(success|X=0)

3.2.9 Probability metric interpretation: Non-linear

  • Linear regression:

    • Constant, linear slope
    • Slope depends on the slope only
  • Logistic regression (probability):

    • Non-linear slope
    • Slope depends on BOTH slope (\(b_1\)) and predicted probability (\(\hat{p}\))
      • The slope of the tangent to the regression line at the predicted outcome value = \(\hat{p} (1-\hat{p}) b_1\)

3.2.10 Probability metric interpretation: Non-linear

When \(\color{blue}{X = 1.5}\):

\[\hat{P}(success) = \hat{p} = \frac{e^{b_0 + b_1 \color{blue}{X}}}{1+e^{b_0 + b_1 \color{blue}{X}}} = \frac{e^{0.251 + 1.219 \times \color{blue}{1.5}}}{1 + e^{0.251 + 1.219 \times \color{blue}{1.5}}} = 0.889\]

Approximate slope at that point is

\[\hat{p} (1-\hat{p}) \color{OrangeRed}{b_1} = 0.889 \times (1 - 0.889) \times \color{OrangeRed}{1.219} = 0.12\]

3.2.11 Probability metric interpretation: Non-linear

X value Predicted probability Slope
-3 0.03 0.04
-2 0.10 0.11
-1 0.28 0.24
0 0.56 0.30
1 0.81 0.19
2 0.94 0.07
3 0.98 0.02

3.2.12 A caution about probability equation

Warning

You might also see the probability defined as \(\hat{p} = \frac{1}{1 + e^{-({b_{0} + b_{1} X})}}\)

Or more generally, \(\hat{p} = \frac{1}{1 + e^{-(Xb)}}\)

  • These are numerically equivalent to what we’ve talked about

    • But did you notice the negative sign?
    • No? You didn’t expect it and missed it in the complicated equation?
    • Yeah, that’s why we don’t use this version

3.3 Odds metric

3.3.1 What are odds?

Odds is the ratio of two probabilities

  • Model the probability of a “success”
  • Odds is the ratio of probability of a “success” (\(\hat{p}\)) to the probability of “not a success” \((1 − \hat{p})\)

\[odds = \frac{\hat{p}}{(1 - \hat{p})}\]

As probability of “success” increases (nonlinearly), the odds of “success” increases (also nonlinearly, but in a different way)

3.3.2 How do odds work?

  • Probability ranges from 0 to 1, switches at 0.5

    • Success more likely than failure when \(p > 0.5\)
    • Success less likely than failure when \(p < 0.5\)
  • Odds range from \(0\) to \(+\infty\), switches at 1

    • Success more likely than failure when \(odds > 1\)
    • Success less likely than failure when \(odds < 1\)

3.3.3 \(\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{0.251 + 1.219 X}\)

3.3.4 Odds metric interpretation: General

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{0.251 + 1.219 X}\]

General interpretation of intercept:
\(b_0\) is related to the odds of success when \(X\) = 0

  • Odds of success when X = 0: \(e^{b_0}\)
  • \(b_0\) > 0: Odds of success > 1 when \(X\) = 0
  • \(b_0\) < 0: Odds of success < 1 when \(X\) = 0

3.3.5 Odds metric interpretation: General

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{0.251 + 1.219 X}\]

General interpretation of slope:
\(b_1\) = relationship between predictor \(X\) and the odds of success

  • \(b_1\) > 0: Odds of success increases as \(X\) increases
  • \(b_1\) < 0: Odds of a success decreases as \(X\) increases

3.3.6 Odds metric interpretation: Example

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{\color{OrangeRed}{0.251} + 1.219 X}\]

Interpretation of example intercept:

  • \(b_0 > 0\): Odds of success > 1 when \(X\) = 0
    • Success (1) more likely than failure (0) when \(X\) = 0
  • Odds of success when \(X\) = 0: \(e^{\color{OrangeRed}{b_0}} = e^{\color{OrangeRed}{0.251}} = 1.29\)
    • A “success” is about 1.29 times as likely as a “failure”
    • Compare to 0.562 probability of success: 0.562 / 0.438 = 1.28

3.3.7 Odds metric interpretation: Example

\[\hat{odds} = \frac{\hat{p}}{(1 - \hat{p})} = e^{0.251 + \color{OrangeRed}{1.219} X}\]

Interpretation of example slope:

\(b_1\) > 0: Odds of a success increases as \(X\) increases

3.3.8 Odds metric interpretation: Non-linear

3.3.9 Odds metric interpretation: Non-linear

  • This non-linear change is presented in terms of odds ratio

    • Constant, multiplicative change in predicted odds
    • For a 1-unit difference in \(X\), the predicted odds of success is multiplied by the odds ratio
  • Example: odds ratio \(= e^{b_1}= e^{1.219} = 3.38\)

    • For a 1-unit difference in \(X\), the predicted odds of success is multiplied by \(3.38\)

3.3.10 Odds metric interpretation: Non-linear

  • Odds ratio \(= e^{b_1}= e^{1.219} = 3.38\)

  • Odds ratio for \(X\) = 1 versus \(X\) = 0 : \(\frac{odds(X = 1)}{odds(X = 0)} = \frac{4.3492351}{1.2853101} = 3.38\)

    • Odds of success is 3.38 times larger when \(X\) = 1 vs \(X\) = 0
  • Odds ratio for \(X\) = 2 versus \(X\) = 1 : \(\frac{odds(X = 2)}{odds(X = 1)} = \frac{14.7169516}{4.3492351} = 3.38\)

    • Odds of success is 3.38 times larger when \(X\) = 2 vs \(X\) = 1
  • In fact, ANY 1 unit difference in \(X\)

  • Constant multiplicative change

3.3.11 Odds metric figure again (odds ratio = 3.38)

3.3.12 Odds metric interpretation: Non-linear

X value Predicted probability Predicted odds
-3 0.03 0.03
-2 0.10 0.11
-1 0.28 0.38
0 0.56 1.29
1 0.81 4.35
2 0.94 14.72
3 0.98 49.80

3.3.13 A caution about odds

Warning

  • Odds ratios are very popular in medicine and epidemiology
  • They can be extremely misleading
  • The same odds ratio corresponds to many different probability values
    • Odds ratio \(= \frac{odds = 3}{odds = 1} = 3\)
      • Corresponds to probability of 0.75 vs 0.5
    • Odds ratio \(= \frac{odds = 9}{odds = 3} = 3\)
      • Corresponds to probability of 0.90 vs 0.75

3.4 Logit or log-odds metric

3.4.1 What is the logit?

Logit or log-odds is the natural log (\(ln\)) of the odds

  • As probability of “success” increases (nonlinearly, S-shaped curve)

    • The odds of “success” increases (also nonlinearly, exponentially up)
    • The logit of “success” increases linearly

3.4.2 How does the logit work?

  • Probability ranges from 0 to 1, switches at 0.5

  • Odds range from 0 to \(+\infty\) , switches at 1

  • Logit ranges from \(-\infty\) to \(+\infty\), switches at 0

    • Success more likely than failure when logit > 0
    • Success less likely than failure when logit < 0

3.4.3 \(\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 0.251 + 1.219 X\)

3.4.4 Logit metric interpretation: General

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 0.251 + 1.219 X\]

General interpretation of intercept:
\(b_0\) is related to the logit of success when X = 0

  • Logit of success when X = 0: \(b_0\)
  • \(b_0\) > 0: Logit > 0 when X = 0
  • \(b_0\) < 0: Logit < 0 when X = 0

3.4.5 Logit metric interpretation: General

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 0.251 + 1.219 X\]

General interpretation of slope:
\(b_1\) is the relationship between predictor X and logit of success

  • \(b_1\) > 0: Logit of a success increases as X increases
  • \(b_1\) < 0: Logit of a success decreases as X increases

3.4.6 Logit metric interpretation: Example

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = \color{OrangeRed}{0.251} + 1.219 X\]

Interpretation of example intercept

  • \(b_0\) > 0: Logit > 0 when X = 0
  • Logit of success when X = 0: \(\color{OrangeRed}{b_0} = \color{OrangeRed}{0.251}\)

3.4.7 Logit metric interpretation: Example

\[\hat{logit} = ln\left(\frac{\hat{p}}{(1 - \hat{p})}\right) = 0.251 + 1.219 X\]

Interpretation of example slope

  • \(b_1\) > 0: Logit of a success increases by \(\color{OrangeRed}{1.219}\) units when X increases by 1 unit

3.5 Metrics wrap-up

3.5.1 So which metric should I use?

They are equivalent, so use the metric that

  • Makes the most sense to you
  • You can explain fully
  • Is most commonly used in your field

3.5.2 Some things to keep in mind

  • Odds ratios tell you about change, but not where you start

    • If you report odds ratios, also report some measure of probability e.g., probability of success at the mean of X
    • 10x change is \(5%\) to \(50%\) or \(0.05%\) to \(0.5%\)?
  • Logit is nice because it’s linear, but it’s not very interpretable

    • What is a “logit”? It’s just a mathematical concept that makes a straight line – not actually meaningful
    • But many psychology measures don’t have meaningful metrics…

3.5.3 Confidence intervals

Default results are in logit metric: compare to null value of 0

term estimate
(Intercept) 0.251
x 1.219

Confidence intervals are in logit metric: does it contain 0?

2.5 % 97.5 %
(Intercept) -0.188 0.703
x 0.661 1.876

3.5.4 Confidence intervals

\(e^{estimate}\) converts to odds ratio metric: compare to null value of 1

term estimate OR
(Intercept) 0.251 1.285
x 1.219 3.383

\(e^{estimate}\) converts to odds ratio metric: does it contain 1?

2.5 % 97.5 % OR 2.5 % OR 97.5 %
(Intercept) -0.188 0.703 0.829 2.019
x 0.661 1.876 1.938 6.528

3.6 A tiny detour

3.6.1 Three alternatives / extensions

  • What if I want to focus more on probability (and don’t care about odds ratios)?
    • Probit regression: based on the cumulative normal distribution, not the logistic distribution
  • What if I have three or more options for my outcome?
    • Categories have an order to them: Ordinal logistic regression
    • Categories have no order to them: Multinomial logistic regression

4 Estimation and model fit

4.1 Estimation

4.1.1 You ran a model: What now?

Usually two things you want to do with it

  • Compute some measure of predictive power or model fit

    • \(R^2_{multiple}\) or similar
  • Compare that model to another competing model

    • Which model is better?

4.1.2 Model estimation

Linear regression is estimated using ordinary least squares (OLS)

  • Produces sums of squares (\(SS\))
  • Measures like \(R^2\) are a function of \(SS\)

GLiMs (like logistic regression) are estimated using maximum likelihood

  • No sums of squares
  • Instead: Deviance, which is a function of the log-likelihood

4.1.3 What is deviance?

  • Conceptually similar to \(SS_{residual}\)

  • If you had \(n\) predictors

    • One predictor per person
    • Perfectly predict the outcome values
    • “Perfect” model
  • Deviance is how far from this “perfect” model you are

    • This is “badness” of fit

4.2 \(R^2\) measures

4.2.1 \(R^2\) in linear regression

  • \(R^2\) for linear regression has many desirable qualities

    • Always ranges from 0 to 1
    • Always stays the same or increases with more predictors (never decreases)

Without \(SS_{residual}\), what can we do?

4.2.2 \(R^2\) analogues

  • There are some general measures that work for all GLiMs and some more specific measures that only work for logistic regression

Warning

\(R^2\) analogues don’t have the properties that \(R^2\) in linear regression does

  • Can be less than 0 or greater than 1
  • Can decrease when you add predictors

4.2.3 Pseudo-\(R^2\) or \(R^2_{deviance}\)

\[R^2_{deviance} = 1 - \frac{deviance_{model}}{deviance_{intercept.only.model}}\]

  • Compare your model to a model with no predictor (only intercept)

    • Common for many types of advanced modeling, could do it for linear regression but probably never would
    • Essentially tests how much closer the model is to the “perfect” model than the intercept only model
    • Theoretically bounded by 0 and 1, but in practice…

4.2.4 \(R^2_{McFadden}\)

\[R^2_{McFadden} = 1 - \frac{LL_{model}}{LL_{intercept.only.model}}\]

  • Same idea as \(R^2_{deviance}\), just using LL instead of deviance

    • Theoretically bounded by 0 and 1
    • Relatively independent of base rate
      • Base rate is the overall probability of a success in the sample
      • See DeMaris (2002) for more details about logistic regression specific measures

4.2.5 \(R^2\) as correlation between observed and predicted values

  • In linear regression, \(R^2_{multiple}\) is also the squared correlation between the observed \(Y\) values and the predicted \(Y\) values

  • Most software packages can produce predicted \(Y\) values for your analysis

    • Save predicted values to the dataset
    • Correlate observed and predicted \(Y\) values (squared correlation)

4.3 Model comparisons

4.3.1 Model comparisons

  • In linear regression, if you added a predictor, there were two ways to tell if that predictor was adding to the model:

    • Test of the regression coefficient (i.e., Wald test: \(t\)-test or \(z\)-test)
    • \(R^2_{change}\) for added prediction (with its \(F\)-test)
  • For logistic regression, Wald test of the regression coefficient may not be reliable (see Vaeth, 1985)

    • Need to use some analogue of the significance test for \(R^2_{change}\)

4.3.2 Likelihood ratio (LR) test

  • Ratio of likelihoods

    • Specifically, a function of likelihood from ML estimation
    • Even more specifically, \(-2 \times log-likelihood\)
    • \(-2 \times LL\) is the deviance
  • Test statistic

    • \(\chi^2 = deviance_{model1} - deviance_{model2}\)
    • How did we get from ratio to difference?
      • Division in log metric is subtraction in regular metric

4.3.3 Likelihood ratio (LR) test

\[\chi^2 = deviance_{model1} - deviance_{model2}\]

  • Model 1: simpler model (fewer predictors, worse fit)

  • Model 2: more complex model (more predictors, better fit)

  • Degrees of freedom = difference in number of parameters

    • Significant test: Model 1 is significantly worse than Model 2
    • NS test: Model 1 and 2 are not significantly different, so go with simpler one (Model 1)

4.3.4 LR test: Example

  • Logistic regression example: Deviance \(= 116.146\)

  • Logistic regression model with no predictors (intercept only): Deviance \(= 137.989\)

  • \(\chi^2(1) = 137.989 - 116.146 = 21.843\)

    • Critical value for \(\chi^2\) with \(1\) df and \(\alpha = 0.05\) is \(3.841\)
    • The test is significant: \(21.843 > 3.841\)
      • Model 2 is better than Model 1
      • The predictor is significant

5 Summary

5.1 Summary

5.1.1 Summary

  • Use logistic regression when your outcome is binary

    • Don’t use linear regression
  • Be careful with interpretation no matter what

    • Probability: Probability makes sense, but it’s nonlinear
    • Odds: Odds ratio seems to make sense but it can be misleading
    • Logit: Linear but what even is a logit?
  • But many basic concepts parallel linear regression

    • Intercept, slope(s), linear combination, \(R^2_{multiple}\)

5.1.2 In class

  • We will

    • Run some logistic regression models
    • Interpret the results