Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Conduct Poisson regression for count outcomes
  • Interpret results in terms of predicted counts
  • Describe when to use alternative models / specifications
    • Negative binomial, inflated zeroes, offset, etc.

2 Linear regression review

2.1 Assumptions of linear regression

  • Linear regression (GLM) makes 3 assumptions about the residuals (\(e_i = Y_i - \hat{Y}_i\)) of the model
    • Independence
    • Constant variance
    • Conditional normality

3 Count variables

3.1 What is a count?

  • Count variables occur any time you count the number of times an event occurs in a fixed period of time
    • Number of visits to doctor in past 6 months
    • Number of days of poor QOL per month
    • Number of cigarettes smoked in a day
    • Number of gene repeats in a sample
    • Number of adverse events in a study

3.2 Count outcomes in medical research

Code
count1 <- rpois(n = 1000, lambda = 1.5)
count_dat <- as.data.frame(count1)
ggplot(data = count_dat, aes(count1)) +
    geom_bar(fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "Visits to doctor in past 6 months", y = "Freqency") +
    scale_x_continuous(breaks = c(0:12))

3.3 Count variable properties

  • Always
    • Discrete (whole numbers)
    • Lower bound of 0
  • Typically
    • Low mean
    • High frequency of 0 values
    • Right skewed
    • Heteroskedasticity

3.4 Example data: Alcohol use

  • Simulated data1 made to resemble real data2
  • Outcome: Number of alcoholic drinks consumed last Saturday
    • Ranges from 0 to 17
  • Predictors
    • Sensation seeking: 3 to 7 (theoretically 1 to 7)
    • Sex: 0 = female, 1 = male

3.5 A count variable is not normal

Code
library(readr)
dat <- read_table("jpapoisson.txt", 
    col_names = FALSE)
colnames(dat) <- c("ID", "sensation", "sex", "alcohol")
ggplot(data = dat, aes(x = alcohol)) +
    geom_bar(fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "Number of alcoholic drinks", y = "Frequency")

3.6 Linear regression on count

Code
ggplot(data = dat, aes(x = sensation, y = alcohol)) +
    geom_point(size = 3, 
               alpha = 0.4, 
               color = "#3697aa") +
  geom_smooth(method = "lm", 
              se = FALSE,
              color = "black") + 
  labs(x = "Sensation seeking", y = "Number of alcoholic drinks")

3.7 Residuals for linear regression

Code
m1_lin <- lm(data = dat, alcohol ~ sensation)
resids_m1_lin <- augment(m1_lin)
ggplot(data = resids_m1_lin, 
       aes(x = .resid)) + 
  geom_histogram(aes(y = ..density..), 
                 fill = "#3697aa", 
                 color = "#76777a") +
  geom_density() 

3.8 Residuals for linear regression

Code
ggplot(data = resids_m1_lin, 
       aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line() +
  labs(x = "Theoretical", 
       y = "Sample")

3.9 Residuals vs predictor

Code
ggplot(data = resids_m1_lin, 
       aes(x = sensation, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth()

3.10 Residuals vs predictor

Code
ggplot(data = resids_m1_lin, 
       aes(x = sensation, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth() +   
  geom_abline(intercept = 0.75, 
                slope = -0.75, 
                color = "#dc1e34", 
                size = 1,
                linetype = "dashed") +
    geom_abline(intercept = 1.5, 
                slope = 1.9, 
                color = "#dc1e34", 
                size = 1,
                linetype = "dashed")

3.11 What NOT to do

  • Ignore the problem
    • Do linear regression anyway
    • Get biased estimates and out-of-range prediction
  • Transform the outcome
    • Square root, natural log, etc.
    • May slightly normalize univariate residual distribution
    • Does not fix heteroscedasticity, (conditional) non-normality

3.12 Transform the outcome - ugh!

Code
dat2 <- dat %>%
    mutate(lnalc = log(alcohol + 0.01))
ggplot(data = dat2, aes(x = lnalc)) +
    geom_histogram(fill = "#3697aa", 
                   color = "#76777a", 
                   size = 1) +
    labs(x = "ln(Number of alcoholic drinks)", y = "Frequency")

3.13 Transform the outcome - ugh!

Code
dat2 <- dat2 %>%
    mutate(sqrtalc = sqrt(alcohol))
ggplot(data = dat2, aes(x = sqrtalc)) +
    geom_histogram(fill = "#3697aa", 
                   color = "#76777a", 
                   size = 1) +
    labs(x = "Square root(Number of alcoholic drinks)", y = "Frequency")

3.14 What to do

  • Generalized linear model (GLiM)
    • Not a single model but a family of regression models
    • Choose features (e.g., residual distribution) to match the characteristics of your outcome variable
    • Accommodates many continuous and categorical outcome variables
    • Includes logistic regression and Poisson regression
    • Transformation of the predicted outcome

4 Poisson regression

4.1 Poisson regression

  • Outcome: count in a fixed period of time
    • Discrete, lower bound of zero, often right skewed
  • Residual / outcome distribution: Poisson
  • Link function: natural log = \(ln(\hat{Y}) = ln(\hat{\mu})\)

\[ln(\hat{\mu}) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p\]

4.2 Reminder: normal distribution

\[f(x) = {\frac {1}{{\sqrt {2\pi \sigma^2}}}}e^{-{\frac {(x-\mu)^2 }{2 \sigma^2 }}}\]

Mean of normal distribution = \(\mu\)

Variance of normal distribution = \(\sigma^2\)

  • Mean and variance are different parameters and are unrelated
    • Homoscedasticity: Constant variance doesn’t depend on mean

4.3 Poisson distribution

  • Developed by Simeon Denis Poisson (1781 - 1840)
  • Number of times a low probability event happens
    • Number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry (Bortkiewicz)
    • Number of yeast cells used when brewing Guinness beer
    • Number of typographical errors on each page of a book
    • Number of bombs falling in London in each block during Blitz
    • Number of earthquakes per day in Southern California

4.4 Poisson distribution

\[P(X = k) = \frac{\lambda^k}{k!}e^{-\lambda}\]

  • Poisson distribution is
    • Discrete
    • Undefined for values less than 0
    • Right skewed when mean is low
    • \(\lambda =\) mean and variance: Heteroscedasticity

4.5 Poisson: mean = variance = 1

Code
pois_1 <- rpois(10000, 1)
pois_1_data <- data.frame(pois_1)
ggplot(data = pois_1_data, aes(x = pois_1)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)),
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.6 Poisson: mean = variance = 4

Code
pois_4 <- rpois(10000, 4)
pois_4_data <- data.frame(pois_4)
ggplot(data = pois_4_data, aes(x = pois_4)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)),
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.7 Poisson: mean = variance = 7

Code
pois_7 <- rpois(10000, 7)
pois_7_data <- data.frame(pois_7)
ggplot(data = pois_7_data, aes(x = pois_7)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)),
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.8 Poisson: mean = variance = 10

Code
pois_10 <- rpois(10000, 10)
pois_10_data <- data.frame(pois_10)
ggplot(data = pois_10_data, aes(x = pois_10)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)), 
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.9 Poisson regression model

  • glm() function in stats package (always loaded)
    • data: same as lm()
    • formula: same as lm()
    • family: Residual distribution and link function
dat <- dat %>% mutate(sensationC = sensation - mean(sensation, na.rm = TRUE))
m1 <- glm(data = dat, 
          alcohol ~ sensationC, 
          family = poisson(link = "log"))
Code
m1_b0 <- round(coefficients(m1)[1], 3)
m1_b1 <- round(coefficients(m1)[2], 3)

4.10 Poisson regression model

  • Estimate
  • Standard error
  • \(z\) statistic
  • \(p\)-value
  • No \(R^2\)
  • Deviance
summary(m1)

Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5

4.11 Two forms of Poisson regression

Predicted count:

\[\hat{\mu} = e^{\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}}\]

Natural log of predicted count:

\[ln(\hat{\mu}) = \color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}\]

4.12 Predicted count interpretation

\[\hat{\mu} = e^{b_0 + b_1 X}\]

  • Intercept
    • \(e^{b_0}\) is the predicted count when \(X\) = 0
  • Slope
    • \(e^{b_1}\) is the rate ratio
    • For a 1 unit increase in \(X\), the predicted count is multiplied by \(e^{b_1}\)

4.13 Example: \(\hat{\mu} = e^{1.06 + 0.231 \times sensationC}\)

  • Intercept = \(e^{b_0} = e^{1.06} = 2.886\)
    • People with mean value of sensation are expected to drink 2.886 alcoholic drinks
  • Slope = \(e^{b_1} = e^{0.231} = 1.26\)
    • Multiplicative effect of \(X\) on the predicted count
    • For a one unit increase in sensation, the number of predicted alcoholic drinks is multiplied by 1.26
    • Also: 26\(\%\) increase in the predicted outcome for each one unit increase in sensation

4.14 Predicted counts: \(e^{b_1} = 1.26\)

Code
sens_vals <- data.frame(sensationC = c(-2, -1, 0, 1)) 
pred_counts <- predict(m1, newdata = sens_vals, type = "response")
sensationC <- c(-2, -1, 0, 1)
pred_alc <- c(1.82, 2.29, 2.89, 3.64)
pred_counts <- data.frame(sensationC, pred_alc)
ggplot(dat, 
       aes(x=sensationC, 
           y=alcohol)) +
  geom_point(color = "#3697aa",
             alpha = 0.3,
             size = 2) +
  stat_smooth(method="glm", 
              method.args=list(family="poisson"), 
              se=FALSE,
              color = "black") +
  geom_point(data = pred_counts, 
             aes(x = sensationC, y = pred_alc),
             size = 3,
             color = "black") +
  annotate("text", x = -2, y = 1.82, 
           label = "1.82", vjust = -1, size = 6) +
  annotate("text", x = -1, y = 2.29, 
           label = "2.29", vjust = -1, size = 6) +
  annotate("text", x = 0, y = 2.89, 
           label = "2.89", vjust = -1, size = 6) +
  annotate("text", x = 1, y = 3.64, 
           label = "3.64", vjust = -1, size = 6)

4.15 Confidence intervals: Natural log

  • Original coefficients and confidence intervals
    • Natural log of predicted count metric
coef(m1)
(Intercept)  sensationC 
  1.0596254   0.2314809 
confint(m1)
                2.5 %    97.5 %
(Intercept) 1.0008860 1.1172000
sensationC  0.1541381 0.3096215
  • Do they contain 0?

4.16 Confidence intervals: Predicted count

  • Exponentiate estimates and confidence intervals
    • Predicted count metric
exp(coef(m1))
(Intercept)  sensationC 
   2.885290    1.260465 
exp(confint(m1))
               2.5 %   97.5 %
(Intercept) 2.720691 3.056284
sensationC  1.166652 1.362909
  • Do they contain 1?

5 (Over)Dispersion

5.1 Equidispersion

  • Poisson distribution displays equidisersion
    • Conditional mean = conditional variance = \(\lambda\)
  • In practice in data, the actual variance of \(Y\) is typically larger than expected (i.e., larger than the mean)
    • This is called overdispersion

5.2 Why does overdispersion happen?

  • Sometimes, just because
  • Due to omitted predictors
    • Something we should include but are not
    • New predictors affect the mean
      • Heteroscedasticity: They can impact variance
  • “Excess” zeroes
    • Looks like more variance

5.3 Why do we care?

  • The variance estimated in the model (\(\lambda\)) helps determine the standard errors for regression coefficients
  • It also plays a role in calculating deviance
  • If the variance in our data is larger than \(\lambda\), the standard errors are too small and the deviance is wrong
    • Test of significance, pseudo-\(R^2\), model comparisons are wrong

5.4 Is there overdispersion?

  • Quick check: Residual deviance / degrees of freedom > 1
summary(m1)

Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5

5.5 Adding a predictor

  • Can reduce overdispersion
m1add <- glm(data = dat, 
          alcohol ~ sensationC + sex, 
          family = poisson(link = "log"))
summary(m1add)

Call:
glm(formula = alcohol ~ sensationC + sex, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.56334    0.05248  10.734  < 2e-16 ***
sensationC   0.26085    0.03882   6.719 1.83e-11 ***
sex          0.83947    0.06292  13.342  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.76  on 399  degrees of freedom
Residual deviance:  959.46  on 397  degrees of freedom
AIC: 1888.8

Number of Fisher Scoring iterations: 5

5.6 What do we do?

  • There are two main alternative models for overdispersion
    • Overdispersed Poisson regression a.k.a. quasi-Poisson regression
      • Same coefficients, adjusted standard errors
    • Negative binomial regression
      • Different coefficients, different standard errors
  • Both handle overdispersion, but in slightly different ways

5.7 Overdispersed Poisson regression

  • “Fudge factor”
    • Variance is larger than the model says, so multiply variance by something to make it larger
  • Scale parameter: \(\psi = \frac{\chi^2_{Pearson}}{residual\ df}\) or \(\frac{deviance}{residual\ df}\)
    • If equidispersion holds: \(\psi = 1\)
    • If overdispserion: \(\psi > 1\)
    • Compared to Poisson: Each standard error \(\times \sqrt{\psi}\)

5.8 Overdispersed Poisson regression

m2 <- glm(data = dat, 
          alcohol ~ sensationC, 
          family = quasipoisson)
summary(m2)

Call:
glm(formula = alcohol ~ sensationC, family = quasipoisson, data = dat)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05963    0.05006  21.166  < 2e-16 ***
sensationC   0.23148    0.06692   3.459 0.000601 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 2.847168)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

5.9 Compare Poisson to OD Poisson

  • Poisson regression
summary(m1)

Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5
  • Overdispersed Poisson
summary(m2)

Call:
glm(formula = alcohol ~ sensationC, family = quasipoisson, data = dat)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05963    0.05006  21.166  < 2e-16 ***
sensationC   0.23148    0.06692   3.459 0.000601 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 2.847168)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

5.10 Negative binomial regression

  • Negative binomial distribution
    • Poisson distribution + gamma distribution
    • Adds additional variance
  • Different scale parameter: \(\alpha\), but in R is \(\theta = 1/\alpha\)
    • Mean = \(\mu\)
    • Variance = \(\mu + \alpha \mu^2\)
    • If overdispersion: \(\alpha\) > 0
    • If not: \(\alpha\) = 0 and reduces to Poisson

5.11 Negative binomial regression

library(MASS)
m3 <- glm.nb(data = dat, 
             alcohol ~ sensationC)
summary(m3)

Call:
glm.nb(formula = alcohol ~ sensationC, data = dat, init.theta = 1.392859438, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.06039    0.05170  20.509  < 2e-16 ***
sensationC   0.22050    0.06822   3.232  0.00123 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.3929) family taken to be 1)

    Null deviance: 463.49  on 399  degrees of freedom
Residual deviance: 452.56  on 398  degrees of freedom
AIC: 1772.1

Number of Fisher Scoring iterations: 1

              Theta:  1.393 
          Std. Err.:  0.163 

 2 x log-likelihood:  -1766.091 

5.12 Compare Poisson to negative binomial

summary(m1)

Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5
summary(m3)

Call:
glm.nb(formula = alcohol ~ sensationC, data = dat, init.theta = 1.392859438, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.06039    0.05170  20.509  < 2e-16 ***
sensationC   0.22050    0.06822   3.232  0.00123 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.3929) family taken to be 1)

    Null deviance: 463.49  on 399  degrees of freedom
Residual deviance: 452.56  on 398  degrees of freedom
AIC: 1772.1

Number of Fisher Scoring iterations: 1

              Theta:  1.393 
          Std. Err.:  0.163 

 2 x log-likelihood:  -1766.091 

5.13 Compare all three models

  • Poisson regression (\(\psi\) = 1, \(\alpha\) = 0)
         term  estimate  std.error statistic       p.value
1 (Intercept) 1.0596254 0.02966915 35.714723 2.335830e-279
2  sensationC 0.2314809 0.03966037  5.836581  5.328294e-09
  • Overdispersed Poisson regression (\(\psi\) = 2.847, \(\alpha\) = 0)
         term  estimate  std.error statistic      p.value
1 (Intercept) 1.0596254 0.05006239 21.166096 3.702312e-67
2  sensationC 0.2314809 0.06692113  3.459011 6.007424e-04
  • Negative binomial regression (\(\theta\) = 1.393, \(\alpha\) = \(1/\theta\) = 0.718)
         term  estimate  std.error statistic      p.value
1 (Intercept) 1.0603897 0.05170433 20.508721 1.799643e-93
2  sensationC 0.2204963 0.06822183  3.232049 1.229061e-03

5.14 Plot: Poisson vs negative binomial

Code
nb_b0 <- summary(m3)$coefficients[1]
nb_b1 <- summary(m3)$coefficients[2]
ggplot(data = dat, aes(x = sensationC, y = alcohol)) +
    geom_point(size = 3, 
               alpha = 0.4, 
               color = "#3697aa") +
    labs(x = "Sensation seeking (centered)", y = "Number of alcoholic drinks") +
    geom_smooth(method = "glm", 
                method.args = list(family = 'poisson'), 
                se = FALSE, 
                color = "black", 
                fullrange = TRUE) +
    geom_function(fun = function(x) exp(nb_b0 + nb_b1 * x),
                color = "#dc1e34", 
                size = 1)

5.15 Observed vs predicted

Code
ggplot(data = dat, aes(x = alcohol)) + 
    geom_bar(fill = "#3697aa", 
             color = "#76777a", 
             size = 1) +
labs(x = "Number of alcoholic drinks", y = "Frequency") +
ylim(0, 90)

5.16 Observed vs predicted

Code
library(pscl)       # predprob() function for predicted counts
                    # zeroinfl() function for zero-inflated count models
                    # hurdle() for hurdle count models
                    # vuong() for non-nested model comparisons
phat.poi <- predprob(m1)
phat.poi.mn <- apply(phat.poi, 2, mean)
#phat.poi.mn
phat.nb <- predprob(m3)
phat.nb.mn <- apply(phat.nb, 2, mean)
#phat.nb.mn
#phat.zinb <- predprob(zinb)
#phat.zinb.mn <- apply(phat.zinb, 2, mean)
#phat.zinb.mn
x <- seq(0, 17, 1)
probs <- data.frame(x, phat.poi.mn, phat.nb.mn)#, phat.zinb.mn)
#probs
ggplot(data = dat, aes(x = alcohol)) + 
    geom_bar(fill = "#3697aa", 
             color = "#76777a", 
             size = 1) +
    geom_line(data = probs, 
              aes(x = x, y = 400*phat.poi.mn), 
              linetype = "solid",
              size = 1,
              color = "black") +
labs(x = "Number of alcoholic drinks", y = "Frequency") +
ylim(0, 90)

5.17 Observed vs predicted

Code
ggplot(data = dat, aes(x = alcohol)) + 
    geom_bar(fill = "#3697aa", 
             color = "#76777a", 
             size = 1) +
    geom_line(data = probs, 
              aes(x = x, y = 400*phat.poi.mn), 
              linetype = "solid",
              size = 1,
              color = "black") +
    geom_line(data = probs, 
              aes(x = x, y = 400*phat.nb.mn), 
              size = 1,
              #linetype = "dashed",
              color = "#dc1e34") +
labs(x = "Number of alcoholic drinks", y = "Frequency") +
ylim(0, 90)

5.18 Which model to pick?

  • Poisson regression
    • You will likely never satisfy the equidispersion assumption
  • Overdispersed Poisson regression
    • Preferred to Poisson regression
    • Appropriate standard errors and relatively simple model
  • Negative binomial regression
    • More complex, but better able to reproduce the observed counts
    • This may be preferred if you’re interested in prediction

5.19 A word of caution…

Berk & MacDonald (2008)

If apparent overdispersion results from specification errors in the systematic part of the Poisson regression model, resorting to the negative binomial distribution does not help. It can make things worse by giving a false sense of security when the fundamental errors in the model remain.

  • In other words, negative binomial regression doesn’t fix a bad model

6 Zeroes

6.1 Zeroes can be really important

  • Conceptually, zeroes are meaningful

  • Lowest possible value of a count

  • Indicate “nothing”

  • Two situations:

    • Too few zeroes (relative to a Poisson distribution)
    • Too many zeroes (relative to a Poisson distribution)

6.2 Too few zeroes

  • Sometimes you have a study where the outcome is a count, but it cannot take on a value of 0
    • Study of medical visits
      • Have to visit the doctor to get involved in the study
    • Study of substance use
      • Only interested in substance users
  • Truncated Poisson regression model uses a truncated Poisson distribution that removes the probability of zeroes

6.3 Excess zeroes

  • Counts often display “excess” zeroes
    • More values of 0 than expected for a Poisson distribution
  • Even if the rest of the distribution is approximately Poisson, the “excess” zeroes lead to overdispersion
    • Sometimes, what looks like overdispersion is really excess zeroes
  • There are specific Poisson family models to appropriately deal with excess zeroes, depending on why the zeroes are there

6.4 Why are there all these zeroes?

  • This is a substantive question
    • You need to know about the outcome you’re studying
  • Question: Do the people who are responding zero have some probability of responding otherwise?

6.5 Some probability of responding > 0?

  • Yes \(\rightarrow\) zero-inflated Poisson regression
    • Cigarettes smoked for smoker who hasn’t smoked yet today
    • Alcohol consumed for someone who hasn’t had any yet today
  • No \(\rightarrow\) hurdle regression or with-zeroes regression
    • Cigarettes smoked: nonsmoker always responds 0
    • Alcoholic drinks consumed: abstainer always responds 0
    • (These are called “structural zeroes”)

6.6 Zero-inflated Poisson regression

  • There are zeroes have some probability to be non-zero
  • Two parts modeled simultaneously:
    • Logistic regression: Is each zero structural or not?
    • Poisson regression: non-structural zeroes and positive values
  • Can use same set of predictors in both parts, but do not have to
  • Can also model non-structural zeroes and positive values using overdispersed Poisson regression or negative binomial regression

6.7 Hurdle regression (with-zeroes)

  • There are zeroes that have no probability of being non-zero
    • Two different populations: smokers and nonsmokers, drinkers and nondrinkers
  • Two parts modeled simultaneously:
    • Logistic regression: whether zero or not
    • Truncated Poisson regression: positive values only
  • Can use same set of predictors in both parts, but do not have to

7 Variable lengths of time

7.1 Variable lengths of time

  • Poisson regression and variants assume a fixed length of time
    • Number of aggressive acts committed by a child in 1 hour
    • Number of visits to doctor in past 6 months
    • Number of adverse events in a study
  • Often, we measure a count over some variable period of time

7.2 Variable lengths of time

  • Can extend Poisson type models to variable time periods
    • Include natural log of the measurement interval as a predictor with regression coefficient equal to 1:

\[ln(\hat{\mu}) = ln(time) + b_0 + b_1X_1 + b_2X_2 + \cdots + b_pX_p\]

  • In software: include time (specifically, ln(time)) as an “offset”

8 Evaluation and comparison

8.1 Deviance

  • As with logistic regression
    • Estimated with maximum likelihood
    • Use deviance to estimate pseudo-\(R^2\)
  • Deviance is a function of the log-likelihood
    • How far model is from a theoretical perfect model
    • Relative measure so can only compare to another model
  • Deviance usually = -2LL, but not for Poisson models

8.2 Pseudo \(R^2\)

\[R^2_{deviance} = 1 - \frac{deviance_{model}}{deviance_{intercept\ only}}\]

  • Deviance for your model is “Residual deviance” from output
  • Deviance for intercept model is “Null deviance” from output
  • Theoretically ranges from 0 to 1 (but not always in practice)

8.3 Compare models

  • Compare deviances of 2 nested models using likelihood ratio test
  • e.g., compare m1 (only sensationC) to m1add (sensationC and sex)
anova(m1, m1add, test = "LRT")
Analysis of Deviance Table

Model 1: alcohol ~ sensationC
Model 2: alcohol ~ sensationC + sex
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       398    1151.69                          
2       397     959.46  1   192.23 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.4 Nested models

  • Two models are nested if
    • All terms in one are included in the other
    • You can get from one to the other by fixing some paths to a value
  • Why do we care?
    • Likelihood ratio tests can only be used to compare nested models
      • If your models are not nested, you cannot use the likelihood ratio tests to compare them
    • Instead, use AIC, BIC (smaller is better), Vuong test

8.5 Nested models

  1. Linear regression with predictor \(X_1\)
  2. Linear regression with predictors \(X_1\) and \(X_2\)
  • Model 1 is nested within model 2
    • If the coefficient for \(X_2\) = 0 then Model 2 becomes Model 1

8.6 Complicated nesting situations 1

  1. Poisson regression with predictors \(X_3\) and \(X_4\)
  2. Overdispersed Poisson regression w/ predictors \(X_3\) and \(X_4\)
  • Model 3 seems like it’s nested within model 4
    • But it’s not really
  • Same number of things estimated: df = 0

8.7 Complicated nesting situations 2

  1. Overdispersed Poisson regression w/ predictor \(X_5\)
  2. Overdispersed Poisson regression w/ predictors \(X_5\) and \(X_6\)
  • Model 5 is NOT nested within model 6

  • In addition to the extra predictor (\(X_6\)), the overdispersion parameter is not the same in these models

    • Model 5 has a \(\psi\) parameter based on only \(X_5\)
    • Model 6 has a different \(\psi\) parameter based on both \(X_5\) and \(X_6\)
    • Can’t get from Model 6 to Model 5 by setting coefficient to 0

8.8 Complicated nesting situations 3

  1. Overdispersed Poisson regression w/ predictors \(X_7\) and \(X_8\)
  2. Negative binomial regression w/ predictors \(X_7\) and \(X_8\)
  • Model 7 is NOT nested within model 8 (or vice versa)

  • The overdispersion parameters are different for the two models

    • Model 7 has a \(\psi\) parameter and estimates variance = \(\psi \mu\)
    • Model 8 has a \(\alpha\) parameter and estimates variance = \(\mu + \alpha \mu^2\)

8.9 Vuong test

  • For nested or non-nested models
  • vuong() function in pscl package
library(pscl)
vuong(m1, m3)
Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                   -6.911618 model2 > model1 2.3958e-12
AIC-corrected         -6.911618 model2 > model1 2.3958e-12
BIC-corrected         -6.911618 model2 > model1 2.3958e-12

9 Summary

9.1 Summary

  • When your outcome is a count, use a model from the Poisson regression family
    • Nonlinear, exponential model or log-linear model
    • Overdispersion: overdispersed Poisson or negative binomial
    • Too many or too few zeroes
    • Variable lengths of time

9.2 In class

  • Run some Poisson regression models
    • Poisson
    • Overdispersed Poisson
    • Negative binomial
  • Interpret the results

9.3 Next time

  • Logistic regression for binary outcomes
  • Poisson regression for count outcomes
  • Survival analysis for survival outcomes