Checking model assumptions

Model fit

Always check a model’s fit

  • Check for outliers (model-based and non-model-based)

  • Check whether model assumptions (e.g., homoscedasticity) are satisfied

  • Evaluate overall measures of fit such as \(R^2_{mult}\)

  • Evaluate case-specific measures of fit

Outlier statistics

There are three types of outlier statistics to examine

  • Leverage: outliers in predictor space

  • Discrepancy: differences between observed and predicted values

  • Influence: actually changing predicted values or regression coefficients

Cases with high leverage and high discrepancy have the potential to change results

  • Cases with low leverage and/or low discrepancy don’t really

Leverage: hat matrix

Leverage measure comes from the hat matrix

  • Hat matrix: \(X_A (X'_A X_A)^{-1} X'_A\)

  • \(\hat Y = X_A (X'_A X_A)^{-1} X'_A Y\): “puts the hats on the Ys”

  • How far is a case’s predictor value(s) from the center of the points?

  • \(h_{ii}\) = leverage value for case i

Predicted values

Predicted value: outcome values the model predicts for each case

Also called “model-based” or “model-implied”

\(R^2_{mult}\) is the correlation between observed outcome values and predicted outcome values

Discrepancy: Residuals

Raw residuals are the difference between the observed outcome value and the predicted outcome value

\(e_i = Y_{i} - \hat Y_i\)

where \(Y_{i}\) is case i’s observed outcome

and \(\hat Y_i\) is case i’s predicted outcome

Hard to interpret because the metric depends on the metric of the outcome variable

  • Standardizing residuals helps interpretation

Discrepancy: Residuals (standardized)

There are two main ways to standardize residuals

Internally standardized (or studentized) residuals are the residual divided by the standard deviation for the residual

  • internally studentized residual \(= e_i / SD_{e_i}\)

where \(SD_{e_i} = \sqrt{ MS_{error}(1 - h_{ii})}\)

Discrepancy: Residuals (standardized)

Externally standardized (or deleted studentized) residuals are the residual (with predicted value from a model with the point deleted) divided by the standard deviation for the residual (from a model with the point deleted)

  • externally studentized residual \(= d_i / SD_d\)

where the subscript \((i)\) refers to the value from a model with case i deleted

and \(d_i = Y_i - \hat Y_{i(i)}\)

and \(SD_d = \sqrt{ MS_{error(i)}(1 - h_{ii})}\)

  • Use the rstudent() function to calculate

Influence: Cook’s D

How much the case is changing the predicted value

Several possible cut-off values

  • 3*mean(Cook’s d)

  • 4 / n

Influence: DFFITS

How much the case is changing the predicted value

Cut-offs: > 1 for small to medium samples, \(2\sqrt{(k + 1)/n}\) for large samples

  • Use the dffits() function to calculate

Influence: DFBETAS

How much the case is changing the regression coefficients

One value per case, per regression coefficient (including intercept)

Cut-offs: > \(\pm\) 1 for small to medium samples, \(\pm 2/\sqrt{n}\) for large samples

  • Use the dfbetas() function to calculate, but best to calculate separately for each predictor to get the datasets to merge smoothly

Some plotting stuff

broom() package

tidy(): put the regression coefficients, standard errors, p-values, etc. into a data frame

glance(): put the global model fit statistics - \(R^2\), AIC into a data frame

augment(): put the predictor, outcome, predicted values, residuals, etc. into a data frame

Only required argument is the model object

Some plots to consider

QQ plot: assess normality

  • Residuals (raw) as proxy for conditional normality

Index plot: easily spot extreme values

  • Leverage, (standardized) residuals, Cook’s d, DFFITS, DFBETAS

Scatterplot: look for relationships (with smooth line)

  • X = predictor or predicted value, Y = residual (should be no relationship with constant variance)

Residuals versus predicted value

data(FirstYearGPA)
#glimpse(FirstYearGPA)
model2 <- lm(GPA ~ SATV, data = FirstYearGPA)

model2_res <- augment(model2)

res_vs_fit <- ggplot(data = model2_res, 
                     aes(x = .fitted, y = .resid)) +
  labs(x = "Predicted GPA", y = "Raw residual") +
  geom_point() +
  geom_smooth(method = loess, se = TRUE)

Residuals versus predicted value

res_vs_fit
## `geom_smooth()` using formula 'y ~ x'

Fancy plot of residuals

Plot of vertical residual line

fancy_res <- ggplot(data = model2_res, 
                    aes(x = SATV, y = GPA)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, 
              color = "lightgrey") +
  geom_segment(aes(xend = SATV, yend = .fitted), 
               alpha = .2) +
  geom_point(aes(y = .fitted), shape = 1)

Fancy plot of residuals

fancy_res
## `geom_smooth()` using formula 'y ~ x'