BTS 510 Lab 3

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Understand probability as a measure of uncertainty
  • Summarize probabilities with probability distributions
  • Think about uncertainty in sampling
  • Summarize uncertainty in sampling: Sampling distribution

2 Mathematically-derived sampling distribution

  • Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
    • For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
      • In this case, the population distribution and sampling distribution are the same distribution: Normal
      • The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error
Note
  • There is not a mathematically-derived sampling distribution for every population distribution and test statistic
    • If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods

3 Computationally-derived sampling distribution

3.1 Getting started

  • Set a random number seed
    • If you don’t do this, every time you run the code, you’ll get different numbers
    • Setting a random seed makes your simulation replicable
    • (I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
  • Load (and install, if needed) the infer package
    • infer has a function that we’ll use to repeatedly sample
    • There are other ways to do this, but this is very simple and works well
  • Only have to do this once per document
set.seed(12345)
library(infer)

3.2 Standard Normal population: population \sim \mathcal{N}(0, 1)

3.2.1 Create a population

  • rnorm() creates a random variable that follows a normal distribution
    • First argument = Number of observations (population size)
    • Second argument = \mu
    • Third argument = \sigma (not \sigma^2)
norm_pop <- rnorm(1000000, 0, 1)
mean(norm_pop)
[1] 0.0002941277
sd(norm_pop)
[1] 1.001315
var(norm_pop)
[1] 1.002632

3.2.2 Sample from the population

  • Small sample: n = 10
    • rep_sample_n() repeatedly samples from the dataset
      • Piped (%>%) from the norm_pop dataset, so it resamples from that
      • size = 10: Sample size for each sample
      • reps = 10000: Number of repeated samples (could be larger but this is good enough to show how things work and doesn’t take too long to run)
      • replace = TRUE: Sample with replacement
    • summarise(x_bar = mean(norm_pop)): Create a data frame with each of the means from each sample
    • The ggplot() code creates a histogram from the data
      • Modification: bins = 50 increases the bins to make it a little smoother
      • Modification: xlim(-1.5, 1.5) sets limits on the x-axis to make the plots more comparable
      • Make similar modifications in your own plots later in this document
      • We’ll spend the next 2 weeks on ggplot() so you’ll learn all of this (and more!)
norm_means10 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

  • Moderate sample: n = 50
norm_means50 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5) 

  • Large sample: n = 100
norm_means100 <- data.frame(norm_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(norm_pop))

ggplot(data = norm_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(-1.5, 1.5)

3.2.3 Compare estimates

  • Compare means, standard deviations, and variances for each sample size
mean(norm_means10$x_bar)
[1] -0.003408017
mean(norm_means50$x_bar)
[1] -0.0009473133
mean(norm_means100$x_bar)
[1] 0.001145927
sd(norm_means10$x_bar)
[1] 0.3131297
sd(norm_means50$x_bar)
[1] 0.1416664
sd(norm_means100$x_bar)
[1] 0.100464
var(norm_means10$x_bar)
[1] 0.09805018
var(norm_means50$x_bar)
[1] 0.02006937
var(norm_means100$x_bar)
[1] 0.01009301
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.3 Bernoulli population: population \sim B(0.3)

3.3.1 Create a population

  • rbinom() creates a random variable that follows a binomial distribution
    • First argument = Number of observations (population size)
    • Second argument = number of trials (here, just 1 for a Bernoulli trial)
    • Third argument = p (probability of success)
bernoulli_pop <- rbinom(1000000, 1, 0.3)
mean(bernoulli_pop)
[1] 0.300871
sd(bernoulli_pop)
[1] 0.4586369
var(bernoulli_pop)
[1] 0.2103479
  • Copy and modify the code above to resample from this new Bernoulli population

3.3.2 Sample from the population

  • Small sample: n = 10
bern_means10 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Moderate sample: n = 50
bern_means50 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Large sample: n = 100
bern_means100 <- data.frame(bernoulli_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(bernoulli_pop))

ggplot(data = bern_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.3.3 Compare estimates

  • Compare means, standard deviations, and variances for each sample size
mean(bern_means10$x_bar)
[1] 0.29996
mean(bern_means50$x_bar)
[1] 0.300286
mean(bern_means100$x_bar)
[1] 0.301151
sd(bern_means10$x_bar)
[1] 0.1472554
sd(bern_means50$x_bar)
[1] 0.06461003
sd(bern_means100$x_bar)
[1] 0.04610139
var(bern_means10$x_bar)
[1] 0.02168417
var(bern_means50$x_bar)
[1] 0.004174456
var(bern_means100$x_bar)
[1] 0.002125338
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

3.4 Poisson population: population \sim Poisson(1.5)

3.4.1 Create a population

  • rpois() creates a random variable that follows a Poisson distribution
    • First argument = Number of observations (population size)
    • Second argument = \lambda (mean and variance)
      • Remember that \lambda is the variance, but we’re going to look at the standard deviation which is the square root of the variance
poisson_pop <- rpois(1000000, 1.5)
mean(poisson_pop)
[1] 1.501647
sd(poisson_pop)
[1] 1.226078
var(poisson_pop)
[1] 1.503267
  • Copy and modify the code above to resample from this new Poisson population

3.4.2 Sample from the population

  • Small sample: n = 10
pois_means10 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 10, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = pois_means10, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 4)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Moderate sample: n = 50
pois_means50 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 50, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = pois_means50, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 4)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

  • Large sample: n = 100
pois_means100 <- data.frame(poisson_pop) %>%
  rep_sample_n(size = 100, reps = 10000, replace = TRUE) %>%
  summarise(x_bar = mean(poisson_pop))

ggplot(data = pois_means100, aes(x = x_bar)) +
  geom_histogram(bins = 50) +
  xlim(0, 4)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3.4.3 Compare estimates

  • Compare means, standard deviations, and variances for each sample size
mean(pois_means10$x_bar)
[1] 1.49989
mean(pois_means50$x_bar)
[1] 1.502572
mean(pois_means100$x_bar)
[1] 1.503961
sd(pois_means10$x_bar)
[1] 0.3863003
sd(pois_means50$x_bar)
[1] 0.1733869
sd(pois_means100$x_bar)
[1] 0.1214733
var(pois_means10$x_bar)
[1] 0.1492279
var(pois_means50$x_bar)
[1] 0.03006303
var(pois_means100$x_bar)
[1] 0.01475576
  • How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
    • \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})

4 Summarize results

  • In general, how does sample size impact the standard error?
    • i.e., the standard deviation of the sampling distribution?
  • Does it seem to matter what the population distribution is?
    • Based on: Visual inspection of the distribution
    • Based on: Estimates of standard error