BTS 510 Lab 3

library(tidyverse)

Warning: package 'purrr' was built under R version 4.5.1

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

theme_set(theme_classic(base_size = 16))

1 Learning objectives

Understand probability as a measure of uncertainty
Summarize probabilities with probability distributions
Think about uncertainty in sampling
Summarize uncertainty in sampling: Sampling distribution

2 Mathematically-derived sampling distribution

Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
- For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
  - In this case, the population distribution and sampling distribution are the same distribution: Normal
  - The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error

Note

There is not a mathematically-derived sampling distribution for every population distribution and test statistic
- If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods

3 Computationally-derived sampling distribution

3.1 Getting started

Set a random number seed
- If you don’t do this, every time you run the code, you’ll get different numbers
- Setting a random seed makes your simulation replicable
- (I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
Load (and install, if needed) the infer package
- infer has a function that we’ll use to repeatedly sample
- There are other ways to do this, but this is very simple and works well
Only have to do this once per document

set.seed(12345)
library(infer)

3.2 Standard Normal population: population \sim \mathcal{N}(0, 1)

3.2.1 Create a population

rnorm() creates a random variable that follows a normal distribution
- First argument = Number of observations (population size)
- Second argument = \mu
- Third argument = \sigma (not \sigma^2)

norm_pop <- rnorm(1000000, 0, 1)
mean(norm_pop)

[1] 0.0002941277

sd(norm_pop)

[1] 1.001315

var(norm_pop)

[1] 1.002632

3.2.2 Sample from the population

Small sample: n = 10
- rep_sample_n() repeatedly samples from the dataset
  - Piped (%>%) from the norm_pop dataset, so it resamples from that
  - size = 10: Sample size for each sample
  - reps = 10000: Number of repeated samples (could be larger but this is good enough to show how things work and doesn’t take too long to run)
  - replace = TRUE: Sample with replacement
- summarise(x_bar = mean(norm_pop)): Create a data frame with each of the means from each sample
- The ggplot() code creates a histogram from the data
  - Modification: bins = 50 increases the bins to make it a little smoother
  - Modification: xlim(-1.5, 1.5) sets limits on the x-axis to make the plots more comparable
  - Make similar modifications in your own plots later in this document
  - We’ll spend the next 2 weeks on ggplot() so you’ll learn all of this (and more!)

norm_means10 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

Moderate sample: n = 50

norm_means50 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

Large sample: n = 100

norm_means100 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

3.2.3 Compare estimates

Compare means, standard deviations, and variances for each sample size

mean(norm_means10$x_bar)

[1] -0.003408017

mean(norm_means50$x_bar)

[1] -0.0009473133

mean(norm_means100$x_bar)

[1] 0.001145927

sd(norm_means10$x_bar)

[1] 0.3131297

sd(norm_means50$x_bar)

[1] 0.1416664

sd(norm_means100$x_bar)

[1] 0.100464

var(norm_means10$x_bar)

[1] 0.09805018

var(norm_means50$x_bar)

[1] 0.02006937

var(norm_means100$x_bar)

[1] 0.01009301

How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution?
- \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.3 Bernoulli population: population \sim B(0.3)

3.3.1 Create a population

rbinom() creates a random variable that follows a binomial distribution
- First argument = Number of observations (population size)
- Second argument = number of trials (here, just 1 for a Bernoulli trial)
- Third argument = p (probability of success)

bernoulli_pop <- rbinom(1000000, 1, 0.3)
mean(bernoulli_pop)

[1] 0.300871

sd(bernoulli_pop)

[1] 0.4586369

var(bernoulli_pop)

[1] 0.2103479

Copy and modify the code above to resample from this new Bernoulli population

3.3.2 Sample from the population

Small sample: n = 10

bern_means10 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Moderate sample: n = 50

bern_means50 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Large sample: n = 100

bern_means100 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.3.3 Compare estimates

Compare means, standard deviations, and variances for each sample size

mean(bern_means10$x_bar)

[1] 0.29996

mean(bern_means50$x_bar)

[1] 0.300286

mean(bern_means100$x_bar)

[1] 0.301151

sd(bern_means10$x_bar)

[1] 0.1472554

sd(bern_means50$x_bar)

[1] 0.06461003

sd(bern_means100$x_bar)

[1] 0.04610139

var(bern_means10$x_bar)

[1] 0.02168417

var(bern_means50$x_bar)

[1] 0.004174456

var(bern_means100$x_bar)

[1] 0.002125338

How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
- \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.4 Poisson population: population \sim Poisson(1.5)

3.4.1 Create a population

rpois() creates a random variable that follows a Poisson distribution
- First argument = Number of observations (population size)
- Second argument = \lambda (mean and variance)
  - Remember that \lambda is the variance, but we’re going to look at the standard deviation which is the square root of the variance

poisson_pop <- rpois(1000000, 1.5)
mean(poisson_pop)

[1] 1.501647

sd(poisson_pop)

[1] 1.226078

var(poisson_pop)

[1] 1.503267

Copy and modify the code above to resample from this new Poisson population

3.4.2 Sample from the population

Small sample: n = 10

pois_means10 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = pois_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 4)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Moderate sample: n = 50

pois_means50 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = pois_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 4)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Large sample: n = 100

pois_means100 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = pois_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 4)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.4.3 Compare estimates

Compare means, standard deviations, and variances for each sample size

mean(pois_means10$x_bar)

[1] 1.49989

mean(pois_means50$x_bar)

[1] 1.502572

mean(pois_means100$x_bar)

[1] 1.503961

sd(pois_means10$x_bar)

[1] 0.3863003

sd(pois_means50$x_bar)

[1] 0.1733869

sd(pois_means100$x_bar)

[1] 0.1214733

var(pois_means10$x_bar)

[1] 0.1492279

var(pois_means50$x_bar)

[1] 0.03006303

var(pois_means100$x_bar)

[1] 0.01475576

How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
- \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

4 Summarize results

In general, how does sample size impact the standard error?
- i.e., the standard deviation of the sampling distribution?
Does it seem to matter what the population distribution is?
- Based on: Visual inspection of the distribution
- Based on: Estimates of standard error