Always check a model’s fit
Check for outliers (model-based and non-model-based)
Check whether model assumptions (e.g., homoscedasticity) are satisfied
Evaluate overall measures of fit such as \(R^2_{mult}\)
Evaluate case-specific measures of fit
There are three types of outlier statistics to examine
Leverage: outliers in predictor space
Discrepancy: differences between observed and predicted values
Influence: actually changing predicted values or regression coefficients
Cases with high leverage and high discrepancy have the potential to change results
Leverage measure comes from the hat matrix
Hat matrix: \(X_A (X'_A X_A)^{-1} X'_A\)
\(\hat Y = X_A (X'_A X_A)^{-1} X'_A Y\): “puts the hats on the Ys”
How far is a case’s predictor value(s) from the center of the points?
\(h_{ii}\) = leverage value for case i
Predicted value: outcome values the model predicts for each case
Also called “model-based” or “model-implied”
\(R^2_{mult}\) is the correlation between observed outcome values and predicted outcome values
Raw residuals are the difference between the observed outcome value and the predicted outcome value
\(e_i = Y_{i} - \hat Y_i\)
where \(Y_{i}\) is case i’s observed outcome
and \(\hat Y_i\) is case i’s predicted outcome
Hard to interpret because the metric depends on the metric of the outcome variable
There are two main ways to standardize residuals
Internally standardized (or studentized) residuals are the residual divided by the standard deviation for the residual
where \(SD_{e_i} = \sqrt{ MS_{error}(1 - h_{ii})}\)
Externally standardized (or deleted studentized) residuals are the residual (with predicted value from a model with the point deleted) divided by the standard deviation for the residual (from a model with the point deleted)
where the subscript \((i)\) refers to the value from a model with case i deleted
and \(d_i = Y_i - \hat Y_{i(i)}\)
and \(SD_d = \sqrt{ MS_{error(i)}(1 - h_{ii})}\)
rstudent()
function to calculateHow much the case is changing the predicted value
Several possible cut-off values
3*mean(Cook’s d)
4 / n
How much the case is changing the predicted value
Cut-offs: > 1 for small to medium samples, \(2\sqrt{(k + 1)/n}\) for large samples
dffits()
function to calculateHow much the case is changing the regression coefficients
One value per case, per regression coefficient (including intercept)
Cut-offs: > \(\pm\) 1 for small to medium samples, \(\pm 2/\sqrt{n}\) for large samples
dfbetas()
function to calculate, but best to
calculate separately for each predictor to get the datasets to merge
smoothlytidy()
: put the regression coefficients, standard
errors, p-values, etc. into a data frame
glance()
: put the global model fit statistics - \(R^2\), AIC into a data frame
augment()
: put the predictor, outcome, predicted values,
residuals, etc. into a data frame
Only required argument is the model object
QQ plot: assess normality
Index plot: easily spot extreme values
Scatterplot: look for relationships (with smooth line)
data(FirstYearGPA)
#glimpse(FirstYearGPA)
model2 <- lm(GPA ~ SATV, data = FirstYearGPA)
model2_res <- augment(model2)
res_vs_fit <- ggplot(data = model2_res,
aes(x = .fitted, y = .resid)) +
labs(x = "Predicted GPA", y = "Raw residual") +
geom_point() +
geom_smooth(method = loess, se = TRUE)
res_vs_fit
## `geom_smooth()` using formula 'y ~ x'
Plot of vertical residual line
fancy_res <- ggplot(data = model2_res,
aes(x = SATV, y = GPA)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE,
color = "lightgrey") +
geom_segment(aes(xend = SATV, yend = .fitted),
alpha = .2) +
geom_point(aes(y = .fitted), shape = 1)
fancy_res
## `geom_smooth()` using formula 'y ~ x'