BTS 510 Lab 6

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

theme_set(theme_classic(base_size = 16))

1 Learning objectives

Understand probability as a measure of uncertainty
Summarize probabilities with probability distributions
Describe the difference between population and sample
Think about uncertainty in sampling
Summarize uncertainty in sampling: Sampling distribution

2 Mathematically-derived sampling distribution

Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
- For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
  - In this case, the population distribution and sampling distribution are the same distribution: Normal
  - The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error

Note

There is not a mathematically-derived sampling distribution for every population distribution and test statistic
- If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods

3 Computationally-derived sampling distribution

3.1 Getting started

Set a random number seed
- If you don’t do this, every time you run the code, you’ll get different numbers
- Setting a random seed makes your simulation replicable
- (I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
Load (and install, if needed) the infer package
- infer has a function that we’ll use to repeatedly sample
- There are other ways to do this, but this is very simple and works well
Only have to do this once per document

set.seed(12345)
library(infer)

3.2 Standard Normal population: population \sim \mathcal{N}(0, 1)

3.2.1 Create a population

rnorm() creates a random variable that follows a normal distribution
- First argument = Number of observations (population size)
- Second argument = \mu
- Third argument = \sigma (not \sigma^2)

norm_pop <- rnorm(1000000, 0, 1)
mean(norm_pop)

[1] 0.0002941277

var(norm_pop)

[1] 1.002632

3.2.2 Sample from the population

Small sample: n = 10
- rep_sample_n() repeatedly samples from the dataset
  - Piped (%>%) from the norm_pop dataset, so it resamples from that
  - size = 10: Sample size for each sample
  - reps = 10000: Number of repeated samples (could be larger but this is good enough to show how things work and doesn’t take too long to run)
  - replace = TRUE: Sample with replacement
- summarise(x_bar = mean(norm_pop)): Create a data frame with each of the means from each sample

norm_means10 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

Moderate sample: n = 50

norm_means50 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

Large sample: n = 100

norm_means100 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

Note

Note that I adjusted the X axis to be the same for all three plots using xlim() so that you can easily compare the distributions
You’ll want to do the same in your later plots too

3.2.3 Compare estimates

Compare means and variances for each sample size

mean(norm_means10$x_bar)

[1] -0.003408017

mean(norm_means50$x_bar)

[1] -0.0009473133

mean(norm_means100$x_bar)

[1] 0.001145927

var(norm_means10$x_bar)

[1] 0.09805018

var(norm_means50$x_bar)

[1] 0.02006937

var(norm_means100$x_bar)

[1] 0.01009301

How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution?
- \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.3 Bernoulli population: population \sim B(0.3)

3.3.1 Create a population

rbinom() creates a random variable that follows a binomial distribution
- First argument = Number of observations (population size)
- Second argument = number of trials (here, just 1 for a Bernoulli trial)
- Third argument = p (probability of success)

bernoulli_pop <- rbinom(1000000, 1, 0.3)
mean(bernoulli_pop)

[1] 0.300871

var(bernoulli_pop)

[1] 0.2103479

3.3.2 Sample from the population

Small sample: n = 10

bern_means10 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Moderate sample: n = 50

bern_means50 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Large sample: n = 100

bern_means100 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,1)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.3.3 Compare estimates

Compare means and variances for each sample size

mean(bern_means10$x_bar)

[1] 0.29996

mean(bern_means50$x_bar)

[1] 0.300286

mean(bern_means100$x_bar)

[1] 0.301151

var(bern_means10$x_bar)

[1] 0.02168417

var(bern_means50$x_bar)

[1] 0.004174456

var(bern_means100$x_bar)

[1] 0.002125338

3.4 Binomial population: population \sim Bin(50, 0.3)

3.4.1 Create a population

rbinom() creates a random variable that follows a binomial distribution
- First argument = Number of observations (population size)
- Second argument = m (number of trials)
- Third argument = p (probability of success)

binomial_pop <- rbinom(1000000, 50, 0.3)
mean(binomial_pop)

[1] 15.00374

var(binomial_pop)

[1] 10.50284

3.4.2 Sample from the population

Small sample: n = 10

bin_means10 <- data.frame(binomial_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(binomial_pop))

ggplot(data = bin_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(10,20)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Moderate sample: n = 50

bin_means50 <- data.frame(binomial_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(binomial_pop))

ggplot(data = bin_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(10,20)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Large sample: n = 100

bin_means100 <- data.frame(binomial_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(binomial_pop))

ggplot(data = bin_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(10,20)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.4.3 Compare estimates

Compare means and variances for each sample size

mean(bin_means10$x_bar)

[1] 15.00175

mean(bin_means50$x_bar)

[1] 15.00399

mean(bin_means100$x_bar)

[1] 15.00949

var(bin_means10$x_bar)

[1] 1.038956

var(bin_means50$x_bar)

[1] 0.2092708

var(bin_means100$x_bar)

[1] 0.1037692

3.5 Poisson population: population \sim Poisson(1.5)

3.5.1 Create a population

rpois() creates a random variable that follows a Poisson distribution
- First argument = Number of observations (population size)
- Second argument = \lambda (mean and variance)

poisson_pop <- rpois(1000000, 1.5)
mean(poisson_pop)

[1] 1.501162

var(poisson_pop)

[1] 1.49775

3.5.2 Sample from the population

Small sample: n = 10

poi_means10 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = poi_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,5)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Moderate sample: n = 50

poi_means50 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = poi_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,5)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

Large sample: n = 100

poi_means100 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = poi_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,5)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.5.3 Compare estimates

Compare means and variances for each sample size

mean(poi_means10$x_bar)

[1] 1.49879

mean(poi_means50$x_bar)

[1] 1.500166

mean(poi_means100$x_bar)

[1] 1.501913

var(poi_means10$x_bar)

[1] 0.1520507

var(poi_means50$x_bar)

[1] 0.02967762

var(poi_means100$x_bar)

[1] 0.01483781