Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

Conduct Poisson regression for count outcomes
Interpret results in terms of predicted counts
Describe when to use alternative models / specifications
- Negative binomial, inflated zeroes, offset, etc.

2 Linear regression review

2.1 Assumptions of linear regression

Linear regression (GLM) makes 3 assumptions about the residuals (\(e_i = Y_i - \hat{Y}_i\)) of the model
- Independence
- Constant variance
- Conditional normality

3 Count variables

3.1 What is a count?

Count variables occur any time you count the number of times an event occurs in a fixed period of time
- Number of visits to doctor in past 6 months
- Number of days of poor QOL per month
- Number of cigarettes smoked in a day
- Number of gene repeats in a sample
- Number of adverse events in a study

3.2 Count outcomes in medical research

Code

count1 <- rpois(n = 1000, lambda = 1.5)
count_dat <- as.data.frame(count1)
ggplot(data = count_dat, aes(count1)) +
    geom_bar(fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "Visits to doctor in past 6 months", y = "Freqency") +
    scale_x_continuous(breaks = c(0:12))

3.3 Count variable properties

Always
- Discrete (whole numbers)
- Lower bound of 0
Typically
- Low mean
- High frequency of 0 values
- Right skewed
- Heteroskedasticity

3.4 Example data: Alcohol use

Simulated data¹ made to resemble real data²
Outcome: Number of alcoholic drinks consumed last Saturday
- Ranges from 0 to 17
Predictors
- Sensation seeking: 3 to 7 (theoretically 1 to 7)
- Sex: 0 = female, 1 = male

3.5 A count variable is not normal

Code

library(readr)
dat <- read_table("jpapoisson.txt", 
    col_names = FALSE)
colnames(dat) <- c("ID", "sensation", "sex", "alcohol")
ggplot(data = dat, aes(x = alcohol)) +
    geom_bar(fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "Number of alcoholic drinks", y = "Frequency")

3.6 Linear regression on count

Code

ggplot(data = dat, aes(x = sensation, y = alcohol)) +
    geom_point(size = 3, 
               alpha = 0.4, 
               color = "#3697aa") +
  geom_smooth(method = "lm", 
              se = FALSE,
              color = "black") + 
  labs(x = "Sensation seeking", y = "Number of alcoholic drinks")

3.7 Residuals for linear regression

Code

m1_lin <- lm(data = dat, alcohol ~ sensation)
resids_m1_lin <- augment(m1_lin)
ggplot(data = resids_m1_lin, 
       aes(x = .resid)) + 
  geom_histogram(aes(y = ..density..), 
                 fill = "#3697aa", 
                 color = "#76777a") +
  geom_density()

3.8 Residuals for linear regression

Code

ggplot(data = resids_m1_lin, 
       aes(sample = .resid)) + 
  geom_qq() + 
  geom_qq_line() +
  labs(x = "Theoretical", 
       y = "Sample")

3.9 Residuals vs predictor

Code

ggplot(data = resids_m1_lin, 
       aes(x = sensation, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth()

3.10 Residuals vs predictor

Code

ggplot(data = resids_m1_lin, 
       aes(x = sensation, 
           y = .resid)) + 
  geom_point() + 
  geom_smooth() +   
  geom_abline(intercept = 0.75, 
                slope = -0.75, 
                color = "#dc1e34", 
                size = 1,
                linetype = "dashed") +
    geom_abline(intercept = 1.5, 
                slope = 1.9, 
                color = "#dc1e34", 
                size = 1,
                linetype = "dashed")

3.11 What NOT to do

Ignore the problem
- Do linear regression anyway
- Get biased estimates and out-of-range prediction
Transform the outcome
- Square root, natural log, etc.
- May slightly normalize univariate residual distribution
- Does not fix heteroscedasticity, (conditional) non-normality

3.12 Transform the outcome - ugh!

Code

dat2 <- dat %>%
    mutate(lnalc = log(alcohol + 0.01))
ggplot(data = dat2, aes(x = lnalc)) +
    geom_histogram(fill = "#3697aa", 
                   color = "#76777a", 
                   size = 1) +
    labs(x = "ln(Number of alcoholic drinks)", y = "Frequency")

3.13 Transform the outcome - ugh!

Code

dat2 <- dat2 %>%
    mutate(sqrtalc = sqrt(alcohol))
ggplot(data = dat2, aes(x = sqrtalc)) +
    geom_histogram(fill = "#3697aa", 
                   color = "#76777a", 
                   size = 1) +
    labs(x = "Square root(Number of alcoholic drinks)", y = "Frequency")

3.14 What to do

Generalized linear model (GLiM)
- Not a single model but a family of regression models
- Choose features (e.g., residual distribution) to match the characteristics of your outcome variable
- Accommodates many continuous and categorical outcome variables
- Includes logistic regression and Poisson regression
- Transformation of the predicted outcome

4 Poisson regression

4.1 Poisson regression

Outcome: count in a fixed period of time
- Discrete, lower bound of zero, often right skewed
Residual / outcome distribution: Poisson
Link function: natural log = \(ln(\hat{Y}) = ln(\hat{\mu})\)

\[ln(\hat{\mu}) = b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p\]

4.2 Reminder: normal distribution

\[f(x) = {\frac {1}{{\sqrt {2\pi \sigma^2}}}}e^{-{\frac {(x-\mu)^2 }{2 \sigma^2 }}}\]

Mean of normal distribution = \(\mu\)

Variance of normal distribution = \(\sigma^2\)

Mean and variance are different parameters and are unrelated
- Homoscedasticity: Constant variance doesn’t depend on mean

4.3 Poisson distribution

Developed by Simeon Denis Poisson (1781 - 1840)
Number of times a low probability event happens
- Number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry (Bortkiewicz)
- Number of yeast cells used when brewing Guinness beer
- Number of typographical errors on each page of a book
- Number of bombs falling in London in each block during Blitz
- Number of earthquakes per day in Southern California

4.4 Poisson distribution

\[P(X = k) = \frac{\lambda^k}{k!}e^{-\lambda}\]

Poisson distribution is
- Discrete
- Undefined for values less than 0
- Right skewed when mean is low
- \(\lambda =\) mean and variance: Heteroscedasticity

4.5 Poisson: mean = variance = 1

Code

pois_1 <- rpois(10000, 1)
pois_1_data <- data.frame(pois_1)
ggplot(data = pois_1_data, aes(x = pois_1)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)),
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.6 Poisson: mean = variance = 4

Code

pois_4 <- rpois(10000, 4)
pois_4_data <- data.frame(pois_4)
ggplot(data = pois_4_data, aes(x = pois_4)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)),
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.7 Poisson: mean = variance = 7

Code

pois_7 <- rpois(10000, 7)
pois_7_data <- data.frame(pois_7)
ggplot(data = pois_7_data, aes(x = pois_7)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)),
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.8 Poisson: mean = variance = 10

Code

pois_10 <- rpois(10000, 10)
pois_10_data <- data.frame(pois_10)
ggplot(data = pois_10_data, aes(x = pois_10)) + 
    geom_bar(aes(y = (..count..)/sum(..count..)), 
             fill = "#3697aa", color = "#76777a", size = 1) +
    labs(x = "X", y = "Proportion") + 
    xlim(-1, 20) + 
    ylim(0, .400)

4.9 Poisson regression model

glm() function in stats package (always loaded)
- data: same as lm()
- formula: same as lm()
- family: Residual distribution and link function

dat <- dat %>% mutate(sensationC = sensation - mean(sensation, na.rm = TRUE))
m1 <- glm(data = dat, 
          alcohol ~ sensationC, 
          family = poisson(link = "log"))

Code

m1_b0 <- round(coefficients(m1)[1], 3)
m1_b1 <- round(coefficients(m1)[2], 3)

4.10 Poisson regression model

Estimate
Standard error
\(z\) statistic
\(p\)-value
No \(R^2\)
Deviance

summary(m1)


Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5

4.11 Two forms of Poisson regression

Predicted count:

\[\hat{\mu} = e^{\color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}}\]

Natural log of predicted count:

\[ln(\hat{\mu}) = \color{OrangeRed}{b_0 + b_1 X_1 + b_2 X_2 + \cdots + b_p X_p}\]

4.12 Predicted count interpretation

\[\hat{\mu} = e^{b_0 + b_1 X}\]

Intercept
- \(e^{b_0}\) is the predicted count when \(X\) = 0
Slope
- \(e^{b_1}\) is the rate ratio
- For a 1 unit increase in \(X\), the predicted count is multiplied by \(e^{b_1}\)

4.13 Example: \(\hat{\mu} = e^{1.06 + 0.231 \times sensationC}\)

Intercept = \(e^{b_0} = e^{1.06} = 2.886\)
- People with mean value of sensation are expected to drink 2.886 alcoholic drinks
Slope = \(e^{b_1} = e^{0.231} = 1.26\)
- Multiplicative effect of \(X\) on the predicted count
- For a one unit increase in sensation, the number of predicted alcoholic drinks is multiplied by 1.26
- Also: 26\(\%\) increase in the predicted outcome for each one unit increase in sensation

4.14 Predicted counts: \(e^{b_1} = 1.26\)

Code

sens_vals <- data.frame(sensationC = c(-2, -1, 0, 1)) 
pred_counts <- predict(m1, newdata = sens_vals, type = "response")
sensationC <- c(-2, -1, 0, 1)
pred_alc <- c(1.82, 2.29, 2.89, 3.64)
pred_counts <- data.frame(sensationC, pred_alc)
ggplot(dat, 
       aes(x=sensationC, 
           y=alcohol)) +
  geom_point(color = "#3697aa",
             alpha = 0.3,
             size = 2) +
  stat_smooth(method="glm", 
              method.args=list(family="poisson"), 
              se=FALSE,
              color = "black") +
  geom_point(data = pred_counts, 
             aes(x = sensationC, y = pred_alc),
             size = 3,
             color = "black") +
  annotate("text", x = -2, y = 1.82, 
           label = "1.82", vjust = -1, size = 6) +
  annotate("text", x = -1, y = 2.29, 
           label = "2.29", vjust = -1, size = 6) +
  annotate("text", x = 0, y = 2.89, 
           label = "2.89", vjust = -1, size = 6) +
  annotate("text", x = 1, y = 3.64, 
           label = "3.64", vjust = -1, size = 6)

4.15 Confidence intervals: Natural log

Original coefficients and confidence intervals
- Natural log of predicted count metric

coef(m1)

(Intercept)  sensationC 
  1.0596254   0.2314809

confint(m1)

                2.5 %    97.5 %
(Intercept) 1.0008860 1.1172000
sensationC  0.1541381 0.3096215

Do they contain 0?

4.16 Confidence intervals: Predicted count

Exponentiate estimates and confidence intervals
- Predicted count metric

exp(coef(m1))

(Intercept)  sensationC 
   2.885290    1.260465

exp(confint(m1))

               2.5 %   97.5 %
(Intercept) 2.720691 3.056284
sensationC  1.166652 1.362909

Do they contain 1?

5 (Over)Dispersion

5.1 Equidispersion

Poisson distribution displays equidisersion
- Conditional mean = conditional variance = \(\lambda\)
In practice in data, the actual variance of \(Y\) is typically larger than expected (i.e., larger than the mean)
- This is called overdispersion

5.2 Why does overdispersion happen?

Sometimes, just because
Due to omitted predictors
- Something we should include but are not
- New predictors affect the mean
  - Heteroscedasticity: They can impact variance
“Excess” zeroes
- Looks like more variance

5.3 Why do we care?

The variance estimated in the model (\(\lambda\)) helps determine the standard errors for regression coefficients
It also plays a role in calculating deviance
If the variance in our data is larger than \(\lambda\), the standard errors are too small and the deviance is wrong
- Test of significance, pseudo-\(R^2\), model comparisons are wrong

5.4 Is there overdispersion?

Quick check: Residual deviance / degrees of freedom > 1

summary(m1)


Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5

5.5 Adding a predictor

Can reduce overdispersion

m1add <- glm(data = dat, 
          alcohol ~ sensationC + sex, 
          family = poisson(link = "log"))
summary(m1add)


Call:
glm(formula = alcohol ~ sensationC + sex, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.56334    0.05248  10.734  < 2e-16 ***
sensationC   0.26085    0.03882   6.719 1.83e-11 ***
sex          0.83947    0.06292  13.342  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.76  on 399  degrees of freedom
Residual deviance:  959.46  on 397  degrees of freedom
AIC: 1888.8

Number of Fisher Scoring iterations: 5

5.6 What do we do?

There are two main alternative models for overdispersion
- Overdispersed Poisson regression a.k.a. quasi-Poisson regression
  - Same coefficients, adjusted standard errors
- Negative binomial regression
  - Different coefficients, different standard errors
Both handle overdispersion, but in slightly different ways

5.7 Overdispersed Poisson regression

“Fudge factor”
- Variance is larger than the model says, so multiply variance by something to make it larger
Scale parameter: \(\psi = \frac{\chi^2_{Pearson}}{residual\ df}\) or \(\frac{deviance}{residual\ df}\)
- If equidispersion holds: \(\psi = 1\)
- If overdispserion: \(\psi > 1\)
- Compared to Poisson: Each standard error \(\times \sqrt{\psi}\)

5.8 Overdispersed Poisson regression

m2 <- glm(data = dat, 
          alcohol ~ sensationC, 
          family = quasipoisson)
summary(m2)


Call:
glm(formula = alcohol ~ sensationC, family = quasipoisson, data = dat)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05963    0.05006  21.166  < 2e-16 ***
sensationC   0.23148    0.06692   3.459 0.000601 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 2.847168)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

5.9 Compare Poisson to OD Poisson

Poisson regression

summary(m1)


Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5

Overdispersed Poisson

summary(m2)


Call:
glm(formula = alcohol ~ sensationC, family = quasipoisson, data = dat)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05963    0.05006  21.166  < 2e-16 ***
sensationC   0.23148    0.06692   3.459 0.000601 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 2.847168)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

5.10 Negative binomial regression

Negative binomial distribution
- Poisson distribution + gamma distribution
- Adds additional variance
Different scale parameter: \(\alpha\), but in R is \(\theta = 1/\alpha\)
- Mean = \(\mu\)
- Variance = \(\mu + \alpha \mu^2\)
- If overdispersion: \(\alpha\) > 0
- If not: \(\alpha\) = 0 and reduces to Poisson

5.11 Negative binomial regression

library(MASS)
m3 <- glm.nb(data = dat, 
             alcohol ~ sensationC)
summary(m3)


Call:
glm.nb(formula = alcohol ~ sensationC, data = dat, init.theta = 1.392859438, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.06039    0.05170  20.509  < 2e-16 ***
sensationC   0.22050    0.06822   3.232  0.00123 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.3929) family taken to be 1)

    Null deviance: 463.49  on 399  degrees of freedom
Residual deviance: 452.56  on 398  degrees of freedom
AIC: 1772.1

Number of Fisher Scoring iterations: 1

              Theta:  1.393 
          Std. Err.:  0.163 

 2 x log-likelihood:  -1766.091

5.12 Compare Poisson to negative binomial

summary(m1)


Call:
glm(formula = alcohol ~ sensationC, family = poisson(link = "log"), 
    data = dat)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.05963    0.02967  35.715  < 2e-16 ***
sensationC   0.23148    0.03966   5.837 5.33e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1186.8  on 399  degrees of freedom
Residual deviance: 1151.7  on 398  degrees of freedom
AIC: 2079

Number of Fisher Scoring iterations: 5

summary(m3)


Call:
glm.nb(formula = alcohol ~ sensationC, data = dat, init.theta = 1.392859438, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.06039    0.05170  20.509  < 2e-16 ***
sensationC   0.22050    0.06822   3.232  0.00123 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1.3929) family taken to be 1)

    Null deviance: 463.49  on 399  degrees of freedom
Residual deviance: 452.56  on 398  degrees of freedom
AIC: 1772.1

Number of Fisher Scoring iterations: 1

              Theta:  1.393 
          Std. Err.:  0.163 

 2 x log-likelihood:  -1766.091

5.13 Compare all three models

Poisson regression (\(\psi\) = 1, \(\alpha\) = 0)

         term  estimate  std.error statistic       p.value
1 (Intercept) 1.0596254 0.02966915 35.714723 2.335830e-279
2  sensationC 0.2314809 0.03966037  5.836581  5.328294e-09

Overdispersed Poisson regression (\(\psi\) = 2.847, \(\alpha\) = 0)

         term  estimate  std.error statistic      p.value
1 (Intercept) 1.0596254 0.05006239 21.166096 3.702312e-67
2  sensationC 0.2314809 0.06692113  3.459011 6.007424e-04

Negative binomial regression (\(\theta\) = 1.393, \(\alpha\) = \(1/\theta\) = 0.718)

         term  estimate  std.error statistic      p.value
1 (Intercept) 1.0603897 0.05170433 20.508721 1.799643e-93
2  sensationC 0.2204963 0.06822183  3.232049 1.229061e-03

5.14 Plot: Poisson vs negative binomial

Code

nb_b0 <- summary(m3)$coefficients[1]
nb_b1 <- summary(m3)$coefficients[2]
ggplot(data = dat, aes(x = sensationC, y = alcohol)) +
    geom_point(size = 3, 
               alpha = 0.4, 
               color = "#3697aa") +
    labs(x = "Sensation seeking (centered)", y = "Number of alcoholic drinks") +
    geom_smooth(method = "glm", 
                method.args = list(family = 'poisson'), 
                se = FALSE, 
                color = "black", 
                fullrange = TRUE) +
    geom_function(fun = function(x) exp(nb_b0 + nb_b1 * x),
                color = "#dc1e34", 
                size = 1)

5.15 Observed vs predicted

Code

ggplot(data = dat, aes(x = alcohol)) + 
    geom_bar(fill = "#3697aa", 
             color = "#76777a", 
             size = 1) +
labs(x = "Number of alcoholic drinks", y = "Frequency") +
ylim(0, 90)

5.16 Observed vs predicted

Code

library(pscl)       # predprob() function for predicted counts
                    # zeroinfl() function for zero-inflated count models
                    # hurdle() for hurdle count models
                    # vuong() for non-nested model comparisons
phat.poi <- predprob(m1)
phat.poi.mn <- apply(phat.poi, 2, mean)
#phat.poi.mn
phat.nb <- predprob(m3)
phat.nb.mn <- apply(phat.nb, 2, mean)
#phat.nb.mn
#phat.zinb <- predprob(zinb)
#phat.zinb.mn <- apply(phat.zinb, 2, mean)
#phat.zinb.mn
x <- seq(0, 17, 1)
probs <- data.frame(x, phat.poi.mn, phat.nb.mn)#, phat.zinb.mn)
#probs
ggplot(data = dat, aes(x = alcohol)) + 
    geom_bar(fill = "#3697aa", 
             color = "#76777a", 
             size = 1) +
    geom_line(data = probs, 
              aes(x = x, y = 400*phat.poi.mn), 
              linetype = "solid",
              size = 1,
              color = "black") +
labs(x = "Number of alcoholic drinks", y = "Frequency") +
ylim(0, 90)

5.17 Observed vs predicted

Code

ggplot(data = dat, aes(x = alcohol)) + 
    geom_bar(fill = "#3697aa", 
             color = "#76777a", 
             size = 1) +
    geom_line(data = probs, 
              aes(x = x, y = 400*phat.poi.mn), 
              linetype = "solid",
              size = 1,
              color = "black") +
    geom_line(data = probs, 
              aes(x = x, y = 400*phat.nb.mn), 
              size = 1,
              #linetype = "dashed",
              color = "#dc1e34") +
labs(x = "Number of alcoholic drinks", y = "Frequency") +
ylim(0, 90)

5.18 Which model to pick?

Poisson regression
- You will likely never satisfy the equidispersion assumption
Overdispersed Poisson regression
- Preferred to Poisson regression
- Appropriate standard errors and relatively simple model
Negative binomial regression
- More complex, but better able to reproduce the observed counts
- This may be preferred if you’re interested in prediction

5.19 A word of caution…

Berk & MacDonald (2008)

If apparent overdispersion results from specification errors in the systematic part of the Poisson regression model, resorting to the negative binomial distribution does not help. It can make things worse by giving a false sense of security when the fundamental errors in the model remain.

In other words, negative binomial regression doesn’t fix a bad model

6 Zeroes

6.1 Zeroes can be really important

Conceptually, zeroes are meaningful
Lowest possible value of a count
Indicate “nothing”
Two situations:
- Too few zeroes (relative to a Poisson distribution)
- Too many zeroes (relative to a Poisson distribution)

6.2 Too few zeroes

Sometimes you have a study where the outcome is a count, but it cannot take on a value of 0
- Study of medical visits
  - Have to visit the doctor to get involved in the study
- Study of substance use
  - Only interested in substance users
Truncated Poisson regression model uses a truncated Poisson distribution that removes the probability of zeroes

6.3 Excess zeroes

Counts often display “excess” zeroes
- More values of 0 than expected for a Poisson distribution
Even if the rest of the distribution is approximately Poisson, the “excess” zeroes lead to overdispersion
- Sometimes, what looks like overdispersion is really excess zeroes
There are specific Poisson family models to appropriately deal with excess zeroes, depending on why the zeroes are there

6.4 Why are there all these zeroes?

This is a substantive question
- You need to know about the outcome you’re studying
Question: Do the people who are responding zero have some probability of responding otherwise?

6.5 Some probability of responding > 0?

Yes \(\rightarrow\) zero-inflated Poisson regression
- Cigarettes smoked for smoker who hasn’t smoked yet today
- Alcohol consumed for someone who hasn’t had any yet today
No \(\rightarrow\) hurdle regression or with-zeroes regression
- Cigarettes smoked: nonsmoker always responds 0
- Alcoholic drinks consumed: abstainer always responds 0
- (These are called “structural zeroes”)

6.6 Zero-inflated Poisson regression

There are zeroes have some probability to be non-zero
Two parts modeled simultaneously:
- Logistic regression: Is each zero structural or not?
- Poisson regression: non-structural zeroes and positive values
Can use same set of predictors in both parts, but do not have to
Can also model non-structural zeroes and positive values using overdispersed Poisson regression or negative binomial regression

6.7 Hurdle regression (with-zeroes)

There are zeroes that have no probability of being non-zero
- Two different populations: smokers and nonsmokers, drinkers and nondrinkers
Two parts modeled simultaneously:
- Logistic regression: whether zero or not
- Truncated Poisson regression: positive values only
Can use same set of predictors in both parts, but do not have to

7 Variable lengths of time

7.1 Variable lengths of time

Poisson regression and variants assume a fixed length of time
- Number of aggressive acts committed by a child in 1 hour
- Number of visits to doctor in past 6 months
- Number of adverse events in a study
Often, we measure a count over some variable period of time

7.2 Variable lengths of time

Can extend Poisson type models to variable time periods
- Include natural log of the measurement interval as a predictor with regression coefficient equal to 1:

\[ln(\hat{\mu}) = ln(time) + b_0 + b_1X_1 + b_2X_2 + \cdots + b_pX_p\]

In software: include time (specifically, ln(time)) as an “offset”

8 Evaluation and comparison

8.1 Deviance

As with logistic regression
- Estimated with maximum likelihood
- Use deviance to estimate pseudo-\(R^2\)
Deviance is a function of the log-likelihood
- How far model is from a theoretical perfect model
- Relative measure so can only compare to another model
Deviance usually = -2LL, but not for Poisson models

8.2 Pseudo \(R^2\)

\[R^2_{deviance} = 1 - \frac{deviance_{model}}{deviance_{intercept\ only}}\]

Deviance for your model is “Residual deviance” from output
Deviance for intercept model is “Null deviance” from output
Theoretically ranges from 0 to 1 (but not always in practice)

8.3 Compare models

Compare deviances of 2 nested models using likelihood ratio test
e.g., compare m1 (only sensationC) to m1add (sensationC and sex)

anova(m1, m1add, test = "LRT")

Analysis of Deviance Table

Model 1: alcohol ~ sensationC
Model 2: alcohol ~ sensationC + sex
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       398    1151.69                          
2       397     959.46  1   192.23 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.4 Nested models

Two models are nested if
- All terms in one are included in the other
- You can get from one to the other by fixing some paths to a value
Why do we care?
- Likelihood ratio tests can only be used to compare nested models
  - If your models are not nested, you cannot use the likelihood ratio tests to compare them
- Instead, use AIC, BIC (smaller is better), Vuong test

8.5 Nested models

Linear regression with predictor \(X_1\)
Linear regression with predictors \(X_1\) and \(X_2\)

Model 1 is nested within model 2
- If the coefficient for \(X_2\) = 0 then Model 2 becomes Model 1

8.6 Complicated nesting situations 1

Poisson regression with predictors \(X_3\) and \(X_4\)
Overdispersed Poisson regression w/ predictors \(X_3\) and \(X_4\)

Model 3 seems like it’s nested within model 4
- But it’s not really
Same number of things estimated: df = 0

8.7 Complicated nesting situations 2

Overdispersed Poisson regression w/ predictor \(X_5\)
Overdispersed Poisson regression w/ predictors \(X_5\) and \(X_6\)

Model 5 is NOT nested within model 6
In addition to the extra predictor (\(X_6\)), the overdispersion parameter is not the same in these models
- Model 5 has a \(\psi\) parameter based on only \(X_5\)
- Model 6 has a different \(\psi\) parameter based on both \(X_5\) and \(X_6\)
- Can’t get from Model 6 to Model 5 by setting coefficient to 0

8.8 Complicated nesting situations 3

Overdispersed Poisson regression w/ predictors \(X_7\) and \(X_8\)
Negative binomial regression w/ predictors \(X_7\) and \(X_8\)

Model 7 is NOT nested within model 8 (or vice versa)
The overdispersion parameters are different for the two models
- Model 7 has a \(\psi\) parameter and estimates variance = \(\psi \mu\)
- Model 8 has a \(\alpha\) parameter and estimates variance = \(\mu + \alpha \mu^2\)

8.9 Vuong test

For nested or non-nested models
vuong() function in pscl package

library(pscl)
vuong(m1, m3)

Vuong Non-Nested Hypothesis Test-Statistic: 
(test-statistic is asymptotically distributed N(0,1) under the
 null that the models are indistinguishible)
-------------------------------------------------------------
              Vuong z-statistic             H_A    p-value
Raw                   -6.911618 model2 > model1 2.3958e-12
AIC-corrected         -6.911618 model2 > model1 2.3958e-12
BIC-corrected         -6.911618 model2 > model1 2.3958e-12

9 Summary

9.1 Summary

When your outcome is a count, use a model from the Poisson regression family
- Nonlinear, exponential model or log-linear model
- Overdispersion: overdispersed Poisson or negative binomial
- Too many or too few zeroes
- Variable lengths of time

9.2 In class

Run some Poisson regression models
- Poisson
- Overdispersed Poisson
- Negative binomial
Interpret the results

9.3 Next time

~~Logistic regression for binary outcomes~~
~~Poisson regression for count outcomes~~
Survival analysis for survival outcomes