set.seed(12345)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(Stat2Data)
theme_set(theme_classic(base_size = 16))
Note
  • tidyverse really just for ggplot2
  • The data comes from the Stat2Data package, which I’ve loaded here
  • theme_set(theme_classic(base_size = 16)) makes the plots a little simpler and makes their fonts a bit bigger – this is purely my preference

1 Learning objectives

  • Understand the different types of variables that can be measured
  • Think about variables in terms of central tendency or location
  • Think about variables in terms of dispersion or spread
  • Start to think about variable distributions and plots

2 Read in data

Pulse dataset from the Stat2Data package

  • A dataset with 232 observations on the following 7 variables.
    • Active: Pulse rate (beats per minute) after exercise
    • Rest: Resting pulse rate (beats per minute)
    • Smoke: 1=smoker or 0=nonsmoker
    • Sex: 1=female or 0=male
    • Exercise: Typical hours of exercise (per week)
    • Hgt: Height (in inches)
    • Wgt: Weight (in pounds)
data(Pulse)
head(Pulse)
  Active Rest Smoke Sex Exercise Hgt Wgt
1     97   78     0   1        1  63 119
2     82   68     1   0        3  70 225
3     88   62     0   0        3  72 175
4    106   74     0   0        3  72 170
5     78   63     0   1        3  67 125
6    109   65     0   0        3  74 188
str(Pulse)
'data.frame':   232 obs. of  7 variables:
 $ Active  : int  97 82 88 106 78 109 66 68 100 70 ...
 $ Rest    : int  78 68 62 74 63 65 43 65 63 59 ...
 $ Smoke   : int  0 1 0 0 0 0 0 0 0 0 ...
 $ Sex     : int  1 0 0 0 1 0 1 0 0 1 ...
 $ Exercise: int  1 3 3 3 3 3 3 3 1 2 ...
 $ Hgt     : int  63 70 72 72 67 74 67 70 70 65 ...
 $ Wgt     : int  119 225 175 170 125 188 140 200 165 115 ...
Note

Notice that all of the variables are type int in the dataframe. This probably works in this situation, but you can “coerce” the variables into different types, such as Smoke and Sex into factor variables.

  • What type of variable (nominal, ordinal, interval, ratio) are each variable?
    • Think about the criteria: category vs number, ordered vs not, meaningful 0
    • Are some ambiguous?

3 Central tendency

  • Based on the variable types you decided on above, estimate the appropriate measure(s) of central tendency for each variable

  • Continuous(ish) variables

mean(Pulse$Active)
[1] 91.29741
median(Pulse$Active)
[1] 88.5
mean(Pulse$Rest)
[1] 68.34914
median(Pulse$Rest)
[1] 68
mean(Pulse$Exercise)
[1] 2.25431
median(Pulse$Exercise)
[1] 2
mean(Pulse$Hgt)
[1] 68.24569
median(Pulse$Hgt)
[1] 68
mean(Pulse$Wgt)
[1] 157.9181
median(Pulse$Wgt)
[1] 150
  • Binary variables
mean(Pulse$Smoke)
[1] 0.112069
median(Pulse$Smoke)
[1] 0
mean(Pulse$Sex)
[1] 0.4741379
median(Pulse$Sex)
[1] 0

4 Dispersion

  • Based on the variable types you decided on above, estimate the appropriate measure(s) of dispersion for each variable
    • The variables Smoke and Sex are a little tricky
  • Percentiles
quantile(Pulse$Active, c(0.25, 0.75))
25% 75% 
 79 102 
quantile(Pulse$Rest, c(0.25, 0.75))
25% 75% 
 62  74 
quantile(Pulse$Smoke, c(0.25, 0.75))
25% 75% 
  0   0 
quantile(Pulse$Sex, c(0.25, 0.75))
25% 75% 
  0   1 
quantile(Pulse$Exercise, c(0.25, 0.75))
25% 75% 
  2   3 
quantile(Pulse$Hgt, c(0.25, 0.75))
25% 75% 
 65  71 
quantile(Pulse$Wgt, c(0.25, 0.75))
25% 75% 
135 175 
  • Standard deviation
sd(Pulse$Active)
[1] 18.82023
sd(Pulse$Rest)
[1] 9.949378
sd(Pulse$Exercise)
[1] 0.7385363
sd(Pulse$Hgt)
[1] 3.738761
sd(Pulse$Wgt)
[1] 31.83259

5 Plots

  • Based on the plots, what should you present for each variable?
    • Does what you presented do a good job representing each variable?
Note

Code is hidden, so you can look at it if you’d like, but don’t panic about not understanding it

  • Active
Code
ggplot(data = Pulse, aes(x = Active)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Active), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Active), color = "red", linewidth = 1.5, linetype = "dashed")

  • Rest
Code
ggplot(data = Pulse, aes(x = Rest)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Rest), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Rest), color = "red", linewidth = 1.5, linetype = "dashed")

  • Smoke
Code
ggplot(data = Pulse, aes(x = Smoke)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Smoke), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Smoke), color = "red", linewidth = 1.5, linetype = "dashed")

  • Sex
Code
ggplot(data = Pulse, aes(x = Sex)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Sex), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Sex), color = "red", linewidth = 1.5, linetype = "dashed")

  • Exercise
Code
ggplot(data = Pulse, aes(x = Exercise)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Exercise), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Exercise), color = "red", linewidth = 1.5, linetype = "dashed")

  • Hgt
Code
ggplot(data = Pulse, aes(x = Hgt)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Hgt), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Hgt), color = "red", linewidth = 1.5, linetype = "dashed")

  • Wgt
Code
ggplot(data = Pulse, aes(x = Wgt)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Wgt), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Wgt), color = "red", linewidth = 1.5, linetype = "dashed")