BTS 510 Lab 11

set.seed(12345)
library(tidyverse)

Warning: package 'purrr' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Stat2Data)
theme_set(theme_classic(base_size = 16))

1 Learning objectives

Conduct logistic regression for binary outcomes
Interpret results in terms of probability, odds, and log-odds

2 Data

ICU data from the Stat2Data package: n = 200
- ID: Patient ID code
- Survive: 1 = patient survived to discharge or 0 = patient died
- Age: Age (in years)
- AgeGroup: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+)
- Sex: 1 = female or 0 = male
- Infection: 1 = infection suspected or 0 = no infection
- SysBP: Systolic blood pressure (in mm of Hg)
- Pulse: Heart rate (beats per minute)
- Emergency: 1 = emergency admission or 0 = elective admission

3 Analysis

Replicate and extend the analysis from the lecture
- Age and Emergency predict Survive
- Be sure to mean center Age for interpretability

4 Pseudo R^2

For linear regression, R^2_{multiple} is calculated from the sums of squares from ordinary least squares estimation
Logistic regression is estimated via maximum likelihood
- There are no sums of squares
- There is, however, deviance
Deviance is a function of the log-likelihood
- Deviance for a model is how far the model is from a theoretical perfect model where every observation is perfectly predicted
- Relative measure so can only compare to another model
- Big deviance (far from perfect) is worse than small deviance (closer to perfect)
Pseudo R^2
- Compare your model to a model with no predictors (only intercept)
- How much closer is this model to the perfect model than the intercept only model?
- Theoretically ranges from 0 to 1 (but not in practice)
- R^2_{deviance} = 1 - \frac{deviance_{M}}{deviance_{intercept\ only}}
- Deviance for your model is “Residual deviance” from output
- Deviance for intercept model is “Null deviance” from output

5 Confidence intervals

Calculate confidence intervals for all coefficients in a model using confint()
- For example: confint(m1) to get confidence intervals for model m1
- If the CI doesn’t include zero, the coefficient is significant
- Default is 99% confidence interval
  - Use level argument to change
  - e.g., confint(m1, level = 0.99) produces 99% CIs
To convert from default logit metric to odds metric: Exponentiate
- Logit metric: b_1
- Convert to odds ratio: e^{b_1}
- Do the same thing to the confidence intervals
  - exp(confint(m1)) produces the odds metric CIs
  - Compare to “no effect” value of one for odds
  - If the CI doesn’t include one, the coefficient is significant

6 Predicted values

Two main options
- Use R like a calculator to calculate predicted values using equations
  - You can be fancy and grab the coefficients directly from the output
  - e.g., using coef(m1)[1] to get the first coefficient, etc.
  - Or just type the numbers in
- Use predict() to calculate predicted values
  - Set up a new dataset with the predictor values you’re interested in
  - e.g., age_vals <- data.frame(AgeC = c(-20, 0, 20))
  - Then use that new data in predict()
  - e.g., predict(m1, newdata = age_vals, type = "response")
  - type = "response" gives predicted values in probability
  - type = "link" gives predicted values in logit

7 Tasks

Model 1: Survive ~ Age
Model 2: Survive ~ Age + Emergency

Conduct a likelihood ratio test to compare the models. Report the results. Which model is preferred?
Report the full results of the preferred model. This is similar to linear regression. For each coefficient (including intercept), report:

Regression coefficient
Standard error
z statistic (note that there aren’t degrees of freedom for z tests)
p value

Calculate predicted probabilities for 6 predictor combinations:

Emergency admit, 20 years younger than mean
Emergency admit, mean age
Emergency admit, 20 years older than mean
Non-emergency admit, 20 years younger than mean
Non-emergency admit, mean age
Non-emergency admit, 20 years older than mean

Report the odds ratios for each predictor and their confidence intervals. Interpret the odds ratios in words.
Describe the overall findings for this model. Be statistically accurate but avoid jargon and technical terms as much as you can. Be sure to use the names of the variables studied (i.e., survival, age, emergency admission) rather than X and Y.

--- title: "BTS 510 Lab 11" format: html: embed-resources: true self-contained-math: true html-math-method: katex number-sections: true toc: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9" --- ```{r} #| label: setup set.seed(12345) library(tidyverse) library(Stat2Data) theme_set(theme_classic(base_size = 16)) ``` ## Learning objectives * **Conduct** logistic regression for *binary outcomes* * **Interpret** results in terms of *probability*, *odds*, and *log-odds* ## Data * `ICU` data from the **Stat2Data** package: $n$ = 200 * `ID`: Patient ID code * `Survive`: 1 = patient survived to discharge or 0 = patient died * `Age`: Age (in years) * `AgeGroup`: 1 = young (under 50), 2 = middle (50-69), 3 = old (70+) * `Sex`: 1 = female or 0 = male * `Infection`: 1 = infection suspected or 0 = no infection * `SysBP`: Systolic blood pressure (in mm of Hg) * `Pulse`: Heart rate (beats per minute) * `Emergency`: 1 = emergency admission or 0 = elective admission ## Analysis * Replicate and extend the analysis from the lecture * `Age` and `Emergency` predict `Survive` * Be sure to mean center `Age` for interpretability ## Pseudo $R^2$ * For linear regression, $R^2_{multiple}$ is calculated from the **sums of squares** from ordinary least squares estimation * Logistic regression is estimated via maximum likelihood * There are no sums of squares * There is, however, **deviance** * Deviance is a function of the log-likelihood * Deviance for a model is how far the model is from a *theoretical* **perfect model** where every observation is perfectly predicted * **Relative measure** so can only **compare** to another model * Big deviance (far from perfect) is worse than small deviance (closer to perfect) * Pseudo $R^2$ * Compare your model to a model with **no predictors** (only intercept) * How much closer is this model to the perfect model than the intercept only model? * Theoretically ranges from 0 to 1 (but not in practice) * $R^2_{deviance} = 1 - \frac{deviance_{M}}{deviance_{intercept\ only}}$ * Deviance for your model is "Residual deviance" from output * Deviance for intercept model is "Null deviance" from output ## Confidence intervals * Calculate confidence intervals for all coefficients in a model using `confint()` * For example: `confint(m1)` to get confidence intervals for model `m1` * If the CI doesn't include **zero**, the coefficient is significant * Default is 99% confidence interval * Use `level` argument to change * e.g., `confint(m1, level = 0.99)` produces 99% CIs * To convert from default logit metric to odds metric: Exponentiate * Logit metric: $b_1$ * Convert to **odds ratio**: $e^{b_1}$ * Do the same thing to the confidence intervals * `exp(confint(m1))` produces the odds metric CIs * Compare to "no effect" value of **one** for odds * If the CI doesn't include **one**, the coefficient is significant ## Predicted values * Two main options * Use R like a calculator to calculate predicted values using equations * You can be fancy and grab the coefficients directly from the output * e.g., using `coef(m1)[1]` to get the first coefficient, etc. * Or just type the numbers in * Use `predict()` to calculate predicted values * Set up a new dataset with the predictor values you're interested in * e.g., `age_vals <- data.frame(AgeC = c(-20, 0, 20))` * Then use that new data in `predict()` * e.g., `predict(m1, newdata = age_vals, type = "response")` * `type = "response"` gives predicted values in **probability** * `type = "link"` gives predicted values in **logit** ## Tasks * Model 1: `Survive ~ Age` * Model 2: `Survive ~ Age + Emergency` 1. Conduct a likelihood ratio test to compare the models. Report the results. Which model is preferred? 2. Report the full results of the preferred model. This is similar to linear regression. For each coefficient (including intercept), report: * Regression coefficient * Standard error * $z$ statistic (note that there aren't degrees of freedom for $z$ tests) * $p$ value 3. Calculate predicted probabilities for 6 predictor combinations: * Emergency admit, 20 years younger than mean * Emergency admit, mean age * Emergency admit, 20 years older than mean * Non-emergency admit, 20 years younger than mean * Non-emergency admit, mean age * Non-emergency admit, 20 years older than mean 4. Report the odds ratios for each predictor and their confidence intervals. Interpret the odds ratios in words. 5. Describe the overall findings for this model. Be statistically accurate but avoid jargon and technical terms as much as you can. Be sure to use the names of the variables studied (i.e., survival, age, emergency admission) rather than X and Y.