Checking model assumptions

Model fit

Always check a model’s fit

Check for outliers (model-based and non-model-based)
Check whether model assumptions (e.g., homoscedasticity) are satisfied
Evaluate overall measures of fit such as \(R^2_{mult}\)
Evaluate case-specific measures of fit

Outlier statistics

There are three types of outlier statistics to examine

Leverage: outliers in predictor space
Discrepancy: differences between observed and predicted values
Influence: actually changing predicted values or regression coefficients

Cases with high leverage and high discrepancy have the potential to change results

Cases with low leverage and/or low discrepancy don’t really

Leverage: hat matrix

Leverage measure comes from the hat matrix

Hat matrix: \(X_A (X'_A X_A)^{-1} X'_A\)
\(\hat Y = X_A (X'_A X_A)^{-1} X'_A Y\): “puts the hats on the Ys”
How far is a case’s predictor value(s) from the center of the points?
\(h_{ii}\) = leverage value for case i

Predicted values

Predicted value: outcome values the model predicts for each case

Also called “model-based” or “model-implied”

\(R^2_{mult}\) is the correlation between observed outcome values and predicted outcome values

Discrepancy: Residuals

Raw residuals are the difference between the observed outcome value and the predicted outcome value

\(e_i = Y_{i} - \hat Y_i\)

where \(Y_{i}\) is case i’s observed outcome

and \(\hat Y_i\) is case i’s predicted outcome

Hard to interpret because the metric depends on the metric of the outcome variable

Standardizing residuals helps interpretation

Discrepancy: Residuals (standardized)

There are two main ways to standardize residuals

Internally standardized (or studentized) residuals are the residual divided by the standard deviation for the residual

internally studentized residual \(= e_i / SD_{e_i}\)

where \(SD_{e_i} = \sqrt{ MS_{error}(1 - h_{ii})}\)

Discrepancy: Residuals (standardized)

Externally standardized (or deleted studentized) residuals are the residual (with predicted value from a model with the point deleted) divided by the standard deviation for the residual (from a model with the point deleted)

externally studentized residual \(= d_i / SD_d\)

where the subscript \((i)\) refers to the value from a model with case i deleted

and \(d_i = Y_i - \hat Y_{i(i)}\)

and \(SD_d = \sqrt{ MS_{error(i)}(1 - h_{ii})}\)

Use the rstudent() function to calculate

Influence: Cook’s D

How much the case is changing the predicted value

Several possible cut-off values

3*mean(Cook’s d)
4 / n

Influence: DFFITS

How much the case is changing the predicted value

Cut-offs: > 1 for small to medium samples, \(2\sqrt{(k + 1)/n}\) for large samples

Use the dffits() function to calculate

Influence: DFBETAS

How much the case is changing the regression coefficients

One value per case, per regression coefficient (including intercept)

Cut-offs: > \(\pm\) 1 for small to medium samples, \(\pm 2/\sqrt{n}\) for large samples

Use the dfbetas() function to calculate, but best to calculate separately for each predictor to get the datasets to merge smoothly

Some plotting stuff

broom() package

tidy(): put the regression coefficients, standard errors, p-values, etc. into a data frame

glance(): put the global model fit statistics - \(R^2\), AIC into a data frame

augment(): put the predictor, outcome, predicted values, residuals, etc. into a data frame

Only required argument is the model object

Some plots to consider

QQ plot: assess normality

Residuals (raw) as proxy for conditional normality

Index plot: easily spot extreme values

Leverage, (standardized) residuals, Cook’s d, DFFITS, DFBETAS

Scatterplot: look for relationships (with smooth line)

X = predictor or predicted value, Y = residual (should be no relationship with constant variance)

Residuals versus predicted value

data(FirstYearGPA)
#glimpse(FirstYearGPA)
model2 <- lm(GPA ~ SATV, data = FirstYearGPA)

model2_res <- augment(model2)

res_vs_fit <- ggplot(data = model2_res, 
                     aes(x = .fitted, y = .resid)) +
  labs(x = "Predicted GPA", y = "Raw residual") +
  geom_point() +
  geom_smooth(method = loess, se = TRUE)

Residuals versus predicted value

res_vs_fit

## `geom_smooth()` using formula 'y ~ x'

Fancy plot of residuals

Plot of vertical residual line

fancy_res <- ggplot(data = model2_res, 
                    aes(x = SATV, y = GPA)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, 
              color = "lightgrey") +
  geom_segment(aes(xend = SATV, yend = .fitted), 
               alpha = .2) +
  geom_point(aes(y = .fitted), shape = 1)

Fancy plot of residuals

fancy_res

## `geom_smooth()` using formula 'y ~ x'

Statistical Graphics and Communication

Stefany Coxe

October 24, 2019

Checking model assumptions

Model fit

Outlier statistics

Leverage: hat matrix

Predicted values

Discrepancy: Residuals

Discrepancy: Residuals (standardized)

Discrepancy: Residuals (standardized)

Influence: Cook’s D

Influence: DFFITS

Influence: DFBETAS

Some plotting stuff

broom() package

Some plots to consider

Residuals versus predicted value

Residuals versus predicted value

Fancy plot of residuals

Fancy plot of residuals