BTS 510 Lab 8

set.seed(12345)
library(tidyverse)

Warning: package 'purrr' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Stat2Data)
theme_set(theme_classic(base_size = 16))

1 Learning objectives

Assess models for (multi)collinearity among predictors
Conduct outlier analyses to determine extreme and/or problematic cases

2 Data

FirstYearGPA data from the Stat2Data package: n = 219 subjects
- GPA: First-year college GPA on a 0.0 to 4.0 scale
- HSGPA: High school GPA on a 0.0 to 4.0 scale
- SATV: Verbal/critical reading SAT score
- SATM: Math SAT score
- Male: 1= male, 0= female
- HU: Number of credit hours earned in humanities courses in high school
- SS: Number of credit hours earned in social science courses in high school
- FirstGen: 1= student is the first in her or his family to attend college, 0=otherwise
- White: 1= white students, 0= others
- CollegeBound: 1=attended a high school where >=50% students intended to go on to college, 0=otherwise

3 Research question(s)

How do all these variables impact first year GPA (GPA)?
- Are there any problems with collinearity among the predictors?
- Are there problematic cases that are influencing the results?

4 Tasks

Run a linear regression model predicting GPA from all other variables in the dataset.
Some of those variables seem like they could be strongly related, which could cause problems for the model. Check to see if collinearity is an issue for this model. To check this, look at:

Correlations among variables
Variable inflation factors (VIFs)

Based on the correlations between each predictor and the outcome (above), are there predictors that you think should be significant but aren’t? Why do you think they aren’t? (This is a bit of a philosophical question.)
Are there extreme cases that we might be concerned about? Check for extreme values in terms of:

Predictors
Predicted values
Changes to predicted values

Summarize your findings about collinearity and outliers. Use plain language.