Notice that all of the variables are type int in the dataframe. This probably works in this situation, but you can “coerce” the variables into different types, such as Smoke and Sex into factor variables.
What type of variable (nominal, ordinal, interval, ratio) are each variable?
Think about the criteria: category vs number, ordered vs not, meaningful 0
Are some ambiguous?
3 Central tendency
Based on the variable types you decided on above, estimate the appropriate measure(s) of central tendency for each variable
Continuous(ish) variables
mean(Pulse$Active)
[1] 91.29741
median(Pulse$Active)
[1] 88.5
mean(Pulse$Rest)
[1] 68.34914
median(Pulse$Rest)
[1] 68
mean(Pulse$Exercise)
[1] 2.25431
median(Pulse$Exercise)
[1] 2
mean(Pulse$Hgt)
[1] 68.24569
median(Pulse$Hgt)
[1] 68
mean(Pulse$Wgt)
[1] 157.9181
median(Pulse$Wgt)
[1] 150
Binary variables
mean(Pulse$Smoke)
[1] 0.112069
median(Pulse$Smoke)
[1] 0
mean(Pulse$Sex)
[1] 0.4741379
median(Pulse$Sex)
[1] 0
4 Dispersion
Based on the variable types you decided on above, estimate the appropriate measure(s) of dispersion for each variable
The variables Smoke and Sex are a little tricky
Percentiles
quantile(Pulse$Active, c(0.25, 0.75))
25% 75%
79 102
quantile(Pulse$Rest, c(0.25, 0.75))
25% 75%
62 74
quantile(Pulse$Smoke, c(0.25, 0.75))
25% 75%
0 0
quantile(Pulse$Sex, c(0.25, 0.75))
25% 75%
0 1
quantile(Pulse$Exercise, c(0.25, 0.75))
25% 75%
2 3
quantile(Pulse$Hgt, c(0.25, 0.75))
25% 75%
65 71
quantile(Pulse$Wgt, c(0.25, 0.75))
25% 75%
135 175
Standard deviation
sd(Pulse$Active)
[1] 18.82023
sd(Pulse$Rest)
[1] 9.949378
sd(Pulse$Exercise)
[1] 0.7385363
sd(Pulse$Hgt)
[1] 3.738761
sd(Pulse$Wgt)
[1] 31.83259
5 Plots
Based on the plots, what should you present for each variable?
Does what you presented do a good job representing each variable?
Note
Code is hidden, so you can look at it if you’d like, but don’t panic about not understanding it
Active
Code
ggplot(data = Pulse, aes(x = Active)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Active), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Active), color ="red", linewidth =1.5, linetype ="dashed")
Rest
Code
ggplot(data = Pulse, aes(x = Rest)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Rest), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Rest), color ="red", linewidth =1.5, linetype ="dashed")
Smoke
Code
ggplot(data = Pulse, aes(x = Smoke)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Smoke), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Smoke), color ="red", linewidth =1.5, linetype ="dashed")
Sex
Code
ggplot(data = Pulse, aes(x = Sex)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Sex), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Sex), color ="red", linewidth =1.5, linetype ="dashed")
Exercise
Code
ggplot(data = Pulse, aes(x = Exercise)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Exercise), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Exercise), color ="red", linewidth =1.5, linetype ="dashed")
Hgt
Code
ggplot(data = Pulse, aes(x = Hgt)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Hgt), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Hgt), color ="red", linewidth =1.5, linetype ="dashed")
Wgt
Code
ggplot(data = Pulse, aes(x = Wgt)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Wgt), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Wgt), color ="red", linewidth =1.5, linetype ="dashed")
Source Code
---title: "BTS 510 Lab 3"format: html: embed-resources: true self-contained-math: true html-math-method: katex number-sections: true toc: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9"---```{r}#| label: setupset.seed(12345)library(tidyverse)library(Stat2Data)theme_set(theme_classic(base_size =16))```::: {.callout-note}* **tidyverse** really just for **ggplot2*** The data comes from the **Stat2Data** package, which I've loaded here* `theme_set(theme_classic(base_size = 16))` makes the plots a little simpler and makes their fonts a bit bigger -- this is purely my preference:::## Learning objectives* Understand the different **types** of variables that can be measured* Think about variables in terms of **central tendency** or **location*** Think about variables in terms of **dispersion** or **spread*** Start to think about variable **distributions** and **plots**## Read in data`Pulse` dataset from the **Stat2Data** package* A dataset with 232 observations on the following 7 variables. * `Active`: Pulse rate (beats per minute) after exercise * `Rest`: Resting pulse rate (beats per minute) * `Smoke`: 1=smoker or 0=nonsmoker * `Sex`: 1=female or 0=male * `Exercise`: Typical hours of exercise (per week) * `Hgt`: Height (in inches) * `Wgt`: Weight (in pounds)```{r}data(Pulse)head(Pulse)str(Pulse)```::: {.callout-note}Notice that all of the variables are type `int` in the dataframe. This probably works in this situation, but you can "coerce" the variables into different types, such as `Smoke` and `Sex` into `factor` variables.:::* What type of variable (*nominal, ordinal, interval, ratio*) are each variable? * Think about the criteria: category vs number, ordered vs not, meaningful 0 * Are some ambiguous?## Central tendency* Based on the *variable types* you decided on above, estimate the appropriate measure(s) of *central tendency* for each variable* Continuous(ish) variables```{r}mean(Pulse$Active)median(Pulse$Active)mean(Pulse$Rest)median(Pulse$Rest)mean(Pulse$Exercise)median(Pulse$Exercise)mean(Pulse$Hgt)median(Pulse$Hgt)mean(Pulse$Wgt)median(Pulse$Wgt)```* Binary variables```{r}mean(Pulse$Smoke)median(Pulse$Smoke)mean(Pulse$Sex)median(Pulse$Sex)```## Dispersion* Based on the *variable types* you decided on above, estimate the appropriate measure(s) of *dispersion* for each variable * The variables `Smoke` and `Sex` are a little tricky* Percentiles```{r}quantile(Pulse$Active, c(0.25, 0.75))quantile(Pulse$Rest, c(0.25, 0.75))quantile(Pulse$Smoke, c(0.25, 0.75))quantile(Pulse$Sex, c(0.25, 0.75))quantile(Pulse$Exercise, c(0.25, 0.75))quantile(Pulse$Hgt, c(0.25, 0.75))quantile(Pulse$Wgt, c(0.25, 0.75))```* Standard deviation```{r}sd(Pulse$Active)sd(Pulse$Rest)sd(Pulse$Exercise)sd(Pulse$Hgt)sd(Pulse$Wgt)```## Plots* Based on the plots, what should you present for each variable? * Does what you presented do a good job representing each variable?::: {.callout-note}Code is hidden, so you can look at it if you'd like, but don't panic about not understanding it:::* `Active````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Active)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Active), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Active), color ="red", linewidth =1.5, linetype ="dashed")```* `Rest````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Rest)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Rest), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Rest), color ="red", linewidth =1.5, linetype ="dashed")```* `Smoke````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Smoke)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Smoke), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Smoke), color ="red", linewidth =1.5, linetype ="dashed")```* `Sex````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Sex)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Sex), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Sex), color ="red", linewidth =1.5, linetype ="dashed")```* `Exercise````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Exercise)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Exercise), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Exercise), color ="red", linewidth =1.5, linetype ="dashed")```* `Hgt````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Hgt)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Hgt), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Hgt), color ="red", linewidth =1.5, linetype ="dashed")```* `Wgt````{r}#| code-fold: trueggplot(data = Pulse, aes(x = Wgt)) +geom_histogram(bins =30, color ="grey80") +geom_vline(xintercept =mean(Pulse$Wgt), color ="blue", linewidth =1.5) +geom_vline(xintercept =median(Pulse$Wgt), color ="red", linewidth =1.5, linetype ="dashed")```