BTS 510 Lab 3

set.seed(12345)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Stat2Data)
theme_set(theme_classic(base_size = 16))

Note

tidyverse really just for ggplot2
The data comes from the Stat2Data package, which I’ve loaded here
theme_set(theme_classic(base_size = 16)) makes the plots a little simpler and makes their fonts a bit bigger – this is purely my preference

1 Learning objectives

Understand the different types of variables that can be measured
Think about variables in terms of central tendency or location
Think about variables in terms of dispersion or spread
Start to think about variable distributions and plots

2 Read in data

Pulse dataset from the Stat2Data package

A dataset with 232 observations on the following 7 variables.
- Active: Pulse rate (beats per minute) after exercise
- Rest: Resting pulse rate (beats per minute)
- Smoke: 1=smoker or 0=nonsmoker
- Sex: 1=female or 0=male
- Exercise: Typical hours of exercise (per week)
- Hgt: Height (in inches)
- Wgt: Weight (in pounds)

data(Pulse)
head(Pulse)

  Active Rest Smoke Sex Exercise Hgt Wgt
1     97   78     0   1        1  63 119
2     82   68     1   0        3  70 225
3     88   62     0   0        3  72 175
4    106   74     0   0        3  72 170
5     78   63     0   1        3  67 125
6    109   65     0   0        3  74 188

str(Pulse)

'data.frame':   232 obs. of  7 variables:
 $ Active  : int  97 82 88 106 78 109 66 68 100 70 ...
 $ Rest    : int  78 68 62 74 63 65 43 65 63 59 ...
 $ Smoke   : int  0 1 0 0 0 0 0 0 0 0 ...
 $ Sex     : int  1 0 0 0 1 0 1 0 0 1 ...
 $ Exercise: int  1 3 3 3 3 3 3 3 1 2 ...
 $ Hgt     : int  63 70 72 72 67 74 67 70 70 65 ...
 $ Wgt     : int  119 225 175 170 125 188 140 200 165 115 ...

Note

Notice that all of the variables are type int in the dataframe. This probably works in this situation, but you can “coerce” the variables into different types, such as Smoke and Sex into factor variables.

What type of variable (nominal, ordinal, interval, ratio) are each variable?
- Think about the criteria: category vs number, ordered vs not, meaningful 0
- Are some ambiguous?

3 Central tendency

Based on the variable types you decided on above, estimate the appropriate measure(s) of central tendency for each variable
Continuous(ish) variables

mean(Pulse$Active)

[1] 91.29741

median(Pulse$Active)

[1] 88.5

mean(Pulse$Rest)

[1] 68.34914

median(Pulse$Rest)

[1] 68

mean(Pulse$Exercise)

[1] 2.25431

median(Pulse$Exercise)

[1] 2

mean(Pulse$Hgt)

[1] 68.24569

median(Pulse$Hgt)

[1] 68

mean(Pulse$Wgt)

[1] 157.9181

median(Pulse$Wgt)

[1] 150

Binary variables

mean(Pulse$Smoke)

[1] 0.112069

median(Pulse$Smoke)

[1] 0

mean(Pulse$Sex)

[1] 0.4741379

median(Pulse$Sex)

[1] 0

4 Dispersion

Based on the variable types you decided on above, estimate the appropriate measure(s) of dispersion for each variable
- The variables Smoke and Sex are a little tricky
Percentiles

quantile(Pulse$Active, c(0.25, 0.75))

25% 75% 
 79 102

quantile(Pulse$Rest, c(0.25, 0.75))

25% 75% 
 62  74

quantile(Pulse$Smoke, c(0.25, 0.75))

25% 75% 
  0   0

quantile(Pulse$Sex, c(0.25, 0.75))

25% 75% 
  0   1

quantile(Pulse$Exercise, c(0.25, 0.75))

25% 75% 
  2   3

quantile(Pulse$Hgt, c(0.25, 0.75))

25% 75% 
 65  71

quantile(Pulse$Wgt, c(0.25, 0.75))

25% 75% 
135 175

Standard deviation

sd(Pulse$Active)

[1] 18.82023

sd(Pulse$Rest)

[1] 9.949378

sd(Pulse$Exercise)

[1] 0.7385363

sd(Pulse$Hgt)

[1] 3.738761

sd(Pulse$Wgt)

[1] 31.83259

5 Plots

Based on the plots, what should you present for each variable?
- Does what you presented do a good job representing each variable?

Note

Code is hidden, so you can look at it if you’d like, but don’t panic about not understanding it

Active

Code

ggplot(data = Pulse, aes(x = Active)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Active), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Active), color = "red", linewidth = 1.5, linetype = "dashed")

Rest

Code

ggplot(data = Pulse, aes(x = Rest)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Rest), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Rest), color = "red", linewidth = 1.5, linetype = "dashed")

Smoke

Code

ggplot(data = Pulse, aes(x = Smoke)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Smoke), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Smoke), color = "red", linewidth = 1.5, linetype = "dashed")

Sex

Code

ggplot(data = Pulse, aes(x = Sex)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Sex), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Sex), color = "red", linewidth = 1.5, linetype = "dashed")

Exercise

Code

ggplot(data = Pulse, aes(x = Exercise)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Exercise), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Exercise), color = "red", linewidth = 1.5, linetype = "dashed")

Hgt

Code

ggplot(data = Pulse, aes(x = Hgt)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Hgt), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Hgt), color = "red", linewidth = 1.5, linetype = "dashed")

Wgt

Code

ggplot(data = Pulse, aes(x = Wgt)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Wgt), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Wgt), color = "red", linewidth = 1.5, linetype = "dashed")