BTS 510 Lab 3

set.seed(12345)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(Stat2Data)
theme_set(theme_classic(base_size = 16))

Note

tidyverse really just for ggplot2
The data comes from the Stat2Data package, which I’ve loaded here
theme_set(theme_classic(base_size = 16)) makes the plots a little simpler and makes their fonts a bit bigger – this is purely my preference

1 Learning objectives

Understand the different types of variables that can be measured
Think about variables in terms of central tendency or location
Think about variables in terms of dispersion or spread
Start to think about variable distributions and plots

2 Read in data

Pulse dataset from the Stat2Data package

A dataset with 232 observations on the following 7 variables.
- Active: Pulse rate (beats per minute) after exercise
- Rest: Resting pulse rate (beats per minute)
- Smoke: 1=smoker or 0=nonsmoker
- Sex: 1=female or 0=male
- Exercise: Typical hours of exercise (per week)
- Hgt: Height (in inches)
- Wgt: Weight (in pounds)

data(Pulse)
head(Pulse)

  Active Rest Smoke Sex Exercise Hgt Wgt
1     97   78     0   1        1  63 119
2     82   68     1   0        3  70 225
3     88   62     0   0        3  72 175
4    106   74     0   0        3  72 170
5     78   63     0   1        3  67 125
6    109   65     0   0        3  74 188

str(Pulse)

'data.frame':   232 obs. of  7 variables:
 $ Active  : int  97 82 88 106 78 109 66 68 100 70 ...
 $ Rest    : int  78 68 62 74 63 65 43 65 63 59 ...
 $ Smoke   : int  0 1 0 0 0 0 0 0 0 0 ...
 $ Sex     : int  1 0 0 0 1 0 1 0 0 1 ...
 $ Exercise: int  1 3 3 3 3 3 3 3 1 2 ...
 $ Hgt     : int  63 70 72 72 67 74 67 70 70 65 ...
 $ Wgt     : int  119 225 175 170 125 188 140 200 165 115 ...

Note

Notice that all of the variables are type int in the dataframe. This probably works in this situation, but you can “coerce” the variables into different types, such as Smoke and Sex into factor variables.

What type of variable (nominal, ordinal, interval, ratio) are each variable?
- Think about the criteria: category vs number, ordered vs not, meaningful 0
- Are some ambiguous?

3 Central tendency

Based on the variable types you decided on above, estimate the appropriate measure(s) of central tendency for each variable

4 Dispersion

Based on the variable types you decided on above, estimate the appropriate measure(s) of dispersion for each variable
- The variables Smoke and Sex are a little tricky

5 Plots

Based on the plots, what should you present for each variable?
- Does what you presented do a good job representing each variable?

Note

Code is hidden, so you can look at it if you’d like, but don’t panic about not understanding it

Active

Code

ggplot(data = Pulse, aes(x = Active)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Active), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Active), color = "red", linewidth = 1.5, linetype = "dashed")

Rest

Code

ggplot(data = Pulse, aes(x = Rest)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Rest), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Rest), color = "red", linewidth = 1.5, linetype = "dashed")

Smoke

Code

ggplot(data = Pulse, aes(x = Smoke)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Smoke), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Smoke), color = "red", linewidth = 1.5, linetype = "dashed")

Sex

Code

ggplot(data = Pulse, aes(x = Sex)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Sex), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Sex), color = "red", linewidth = 1.5, linetype = "dashed")

Exercise

Code

ggplot(data = Pulse, aes(x = Exercise)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Exercise), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Exercise), color = "red", linewidth = 1.5, linetype = "dashed")

Hgt

Code

ggplot(data = Pulse, aes(x = Hgt)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Hgt), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Hgt), color = "red", linewidth = 1.5, linetype = "dashed")

Wgt

Code

ggplot(data = Pulse, aes(x = Wgt)) + 
  geom_histogram(bins = 30, color = "grey80") +
  geom_vline(xintercept = mean(Pulse$Wgt), color = "blue", linewidth = 1.5) +
  geom_vline(xintercept = median(Pulse$Wgt), color = "red", linewidth = 1.5, linetype = "dashed")

--- title: "BTS 510 Lab 3" format: html: embed-resources: true self-contained-math: true html-math-method: katex number-sections: true toc: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9" --- ```{r} #| label: setup set.seed(12345) library(tidyverse) library(Stat2Data) theme_set(theme_classic(base_size = 16)) ``` ::: {.callout-note} * **tidyverse** really just for **ggplot2** * The data comes from the **Stat2Data** package, which I've loaded here * `theme_set(theme_classic(base_size = 16))` makes the plots a little simpler and makes their fonts a bit bigger -- this is purely my preference ::: ## Learning objectives * Understand the different **types** of variables that can be measured * Think about variables in terms of **central tendency** or **location** * Think about variables in terms of **dispersion** or **spread** * Start to think about variable **distributions** and **plots** ## Read in data `Pulse` dataset from the **Stat2Data** package * A dataset with 232 observations on the following 7 variables. * `Active`: Pulse rate (beats per minute) after exercise * `Rest`: Resting pulse rate (beats per minute) * `Smoke`: 1=smoker or 0=nonsmoker * `Sex`: 1=female or 0=male * `Exercise`: Typical hours of exercise (per week) * `Hgt`: Height (in inches) * `Wgt`: Weight (in pounds) ```{r} data(Pulse) head(Pulse) str(Pulse) ``` ::: {.callout-note} Notice that all of the variables are type `int` in the dataframe. This probably works in this situation, but you can "coerce" the variables into different types, such as `Smoke` and `Sex` into `factor` variables. ::: * What type of variable (*nominal, ordinal, interval, ratio*) are each variable? * Think about the criteria: category vs number, ordered vs not, meaningful 0 * Are some ambiguous? ## Central tendency * Based on the *variable types* you decided on above, estimate the appropriate measure(s) of *central tendency* for each variable ## Dispersion * Based on the *variable types* you decided on above, estimate the appropriate measure(s) of *dispersion* for each variable * The variables `Smoke` and `Sex` are a little tricky ## Plots * Based on the plots, what should you present for each variable? * Does what you presented do a good job representing each variable? ::: {.callout-note} Code is hidden, so you can look at it if you'd like, but don't panic about not understanding it ::: * `Active` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Active)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Active), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Active), color = "red", linewidth = 1.5, linetype = "dashed") ``` * `Rest` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Rest)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Rest), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Rest), color = "red", linewidth = 1.5, linetype = "dashed") ``` * `Smoke` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Smoke)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Smoke), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Smoke), color = "red", linewidth = 1.5, linetype = "dashed") ``` * `Sex` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Sex)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Sex), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Sex), color = "red", linewidth = 1.5, linetype = "dashed") ``` * `Exercise` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Exercise)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Exercise), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Exercise), color = "red", linewidth = 1.5, linetype = "dashed") ``` * `Hgt` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Hgt)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Hgt), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Hgt), color = "red", linewidth = 1.5, linetype = "dashed") ``` * `Wgt` ```{r} #| code-fold: true ggplot(data = Pulse, aes(x = Wgt)) + geom_histogram(bins = 30, color = "grey80") + geom_vline(xintercept = mean(Pulse$Wgt), color = "blue", linewidth = 1.5) + geom_vline(xintercept = median(Pulse$Wgt), color = "red", linewidth = 1.5, linetype = "dashed") ```