BTS 510 Lab 6

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Understand probability as a measure of uncertainty
  • Summarize probabilities with probability distributions
  • Describe the difference between population and sample
  • Think about uncertainty in sampling
  • Summarize uncertainty in sampling: Sampling distribution

2 Mathematically-derived sampling distribution

  • Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
    • For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
      • In this case, the population distribution and sampling distribution are the same distribution: Normal
      • The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error
Note
  • There is not a mathematically-derived sampling distribution for every population distribution and test statistic
    • If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods

3 Computationally-derived sampling distribution

3.1 Getting started

  • Set a random number seed
    • If you don’t do this, every time you run the code, you’ll get different numbers
    • Setting a random seed makes your simulation replicable
    • (I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
  • Load (and install, if needed) the infer package
    • infer has a function that we’ll use to repeatedly sample
    • There are other ways to do this, but this is very simple and works well
  • Only have to do this once per document
set.seed(12345)
library(infer)

3.2 Standard Normal population: population \sim \mathcal{N}(0, 1)

3.2.1 Create a population

  • rnorm() creates a random variable that follows a normal distribution
    • First argument = Number of observations (population size)
    • Second argument = \mu
    • Third argument = \sigma (not \sigma^2)
norm_pop <- rnorm(1000000, 0, 1)
mean(norm_pop)
[1] 0.0002941277
var(norm_pop)
[1] 1.002632

3.2.2 Sample from the population

  • Small sample: n = 10
    • rep_sample_n() repeatedly samples from the dataset
      • Piped (%>%) from the norm_pop dataset, so it resamples from that
      • size = 10: Sample size for each sample
      • reps = 10000: Number of repeated samples (could be larger but this is good enough to show how things work and doesn’t take too long to run)
      • replace = TRUE: Sample with replacement
    • summarise(x_bar = mean(norm_pop)): Create a data frame with each of the means from each sample
norm_means10 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

  • Moderate sample: n = 50
norm_means50 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5) 

  • Large sample: n = 100
norm_means100 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

Note
  • Note that I adjusted the X axis to be the same for all three plots using xlim() so that you can easily compare the distributions
  • You’ll want to do the same in your later plots too

3.2.3 Compare estimates

  • Compare means and variances for each sample size
mean(norm_means10$x_bar)
[1] -0.003408017
mean(norm_means50$x_bar)
[1] -0.0009473133
mean(norm_means100$x_bar)
[1] 0.001145927
var(norm_means10$x_bar)
[1] 0.09805018
var(norm_means50$x_bar)
[1] 0.02006937
var(norm_means100$x_bar)
[1] 0.01009301
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.3 Bernoulli population: population \sim B(0.3)

3.3.1 Create a population

  • rbinom() creates a random variable that follows a binomial distribution
    • First argument = Number of observations (population size)
    • Second argument = number of trials (here, just 1 for a Bernoulli trial)
    • Third argument = p (probability of success)
bernoulli_pop <- rbinom(1000000, 1, 0.3)
mean(bernoulli_pop)
[1] 0.300871
var(bernoulli_pop)
[1] 0.2103479

3.3.2 Sample from the population

  • Small sample: n = 10
bern_means10 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Moderate sample: n = 50
bern_means50 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Large sample: n = 100
bern_means100 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.3.3 Compare estimates

  • Compare means and variances for each sample size
mean(bern_means10$x_bar)
[1] 0.29996
mean(bern_means50$x_bar)
[1] 0.300286
mean(bern_means100$x_bar)
[1] 0.301151
var(bern_means10$x_bar)
[1] 0.02168417
var(bern_means50$x_bar)
[1] 0.004174456
var(bern_means100$x_bar)
[1] 0.002125338

3.4 Binomial population: population \sim Bin(50, 0.3)

3.4.1 Create a population

  • rbinom() creates a random variable that follows a binomial distribution
    • First argument = Number of observations (population size)
    • Second argument = m (number of trials)
    • Third argument = p (probability of success)
binomial_pop <- rbinom(1000000, 50, 0.3)
mean(binomial_pop)
[1] 15.00374
var(binomial_pop)
[1] 10.50284

3.4.2 Sample from the population

  • Small sample: n = 10
bin_means10 <- data.frame(binomial_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(binomial_pop))

ggplot(data = bin_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(10,20)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Moderate sample: n = 50
bin_means50 <- data.frame(binomial_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(binomial_pop))

ggplot(data = bin_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(10,20)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Large sample: n = 100
bin_means100 <- data.frame(binomial_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(binomial_pop))

ggplot(data = bin_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(10,20)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.4.3 Compare estimates

  • Compare means and variances for each sample size
mean(bin_means10$x_bar)
[1] 15.00175
mean(bin_means50$x_bar)
[1] 15.00399
mean(bin_means100$x_bar)
[1] 15.00949
var(bin_means10$x_bar)
[1] 1.038956
var(bin_means50$x_bar)
[1] 0.2092708
var(bin_means100$x_bar)
[1] 0.1037692

3.5 Poisson population: population \sim Poisson(1.5)

3.5.1 Create a population

  • rpois() creates a random variable that follows a Poisson distribution
    • First argument = Number of observations (population size)
    • Second argument = \lambda (mean and variance)
poisson_pop <- rpois(1000000, 1.5)
mean(poisson_pop)
[1] 1.501162
var(poisson_pop)
[1] 1.49775

3.5.2 Sample from the population

  • Small sample: n = 10
poi_means10 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = poi_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,5) 
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Moderate sample: n = 50
poi_means50 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = poi_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,5)  
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Large sample: n = 100
poi_means100 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = poi_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
    xlim(0,5)  
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.5.3 Compare estimates

  • Compare means and variances for each sample size
mean(poi_means10$x_bar)
[1] 1.49879
mean(poi_means50$x_bar)
[1] 1.500166
mean(poi_means100$x_bar)
[1] 1.501913
var(poi_means10$x_bar)
[1] 0.1520507
var(poi_means50$x_bar)
[1] 0.02967762
var(poi_means100$x_bar)
[1] 0.01483781