Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

Describe the logic of hypothesis testing
Interpret tests comparing one sample to hypothesized value
Relate hypothesis testing to confidence intervals
Recognize when to use a nonparametric test

2 Hypothesis testing

2.1 History of hypothesis testing

William Gossett (Student)
- \(t\) distribution, \(t\)-test
Sir Ronald Fisher
- “Significance testing”
- \(p\)-values, \(\alpha\) = .05, \(\beta\) = .2, degrees of freedom
Jerzy Neyman, Karl Pearson, Egon Pearson
- “Hypothesis testing”

2.2 Null and alternative hypotheses

Null hypothesis = no effect in population
- \(H_0\)
Alternative hypothesis = effect in population
- \(H_1\) or \(H_A\)
If there is enough evidence, we can reject \(H_0\)
- Otherwise, we retain or accept \(H_0\)
- We cannot reject \(H_A\)

2.3 Directional vs non-directional tests

Directional (one-tailed) tests
- \(H_0\): \(\mu \le \mu_0\)
- \(H_1\): \(\mu > \mu_0\)
Non-directional (two-tailed) tests
- \(H_0\): \(\mu = \mu_0\)
- \(H_1\): \(\mu \ne \mu_0\)

2.4 95% One tailed vs two tailed

One-tailed test (\(H_1: \mu > 0\))

Code

ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-4, 4)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(-4, 1.65)) +
  annotate("text", x = 0, y = 0.1, label = "95%", color = "white", size = 6) +
  annotate("text", x = 3, y = 0.1, label = "5%", size = 6) +
  geom_vline(xintercept = 1.65, linetype = "dashed", linewidth = 1) +
  annotate("text", x = 2.25, y = 0.35, label = "z=1.65", size = 6) +
  xlim(-4, 4) +
  labs(x = "z", y = "f(z)")

Two-tailed test (\(H_1: \mu \ne 0\))

Code

ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-4, 4)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(-1.96, 1.96)) +
  annotate("text", x = 0, y = 0.1, label = "95%", color = "white", size = 6) +
  annotate("text", x = 3, y = 0.1, label = "2.5%", size = 6) +
  annotate("text", x = -3, y = 0.1, label = "2.5%", size = 6) +
  geom_vline(xintercept = 1.96, linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = -1.96, linetype = "dashed", linewidth = 1) +
  annotate("text", x = 2.6, y = 0.35, label = "z=+1.96", size = 6) +
  annotate("text", x = -2.6, y = 0.35, label = "z=-1.96", size = 6) +
  xlim(-4, 4) +
  labs(x = "z", y = "f(z)")

2.5 Critical values and decisions

Critical value: Distribution value that puts \(\alpha\) in the tail(s)
- i.e., \(\pm\) 1.96 for a two-tailed \(z\)-test, 1.65 for a one-tailed \(z\)-test
If the test statistic is more extreme than the critical value
- Reject \(H_0\)
Otherwise, retain \(H_0\) or accept \(H_0\)

2.6 Rejection region(s)

One-tailed test (\(H_1: \mu > 0\))

Code

ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-4, 4)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(-4, 1.65)) +
  annotate("text", x = 0, y = 0.1, label = "95%", color = "white", size = 6) +
  annotate("text", x = 3, y = 0.1, label = "5%", size = 6) +
  geom_vline(xintercept = 1.65, linetype = "dashed", linewidth = 1) +
  annotate("text", x = 2.25, y = 0.35, label = "z=1.65", size = 6) +
  annotate("text", x = 3, y = 0.25, label = "Rejection \nregion", size = 6) +
  xlim(-4, 4) +
  labs(x = "z", y = "f(z)")

Two-tailed test (\(H_1: \mu \ne 0\))

Code

ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm,
                geom = "line",
                xlim = c(-4, 4)) +
  stat_function(fun = dnorm,
                geom = "area",
                fill = "steelblue",
                xlim = c(-1.96, 1.96)) +
  annotate("text", x = 0, y = 0.1, label = "95%", color = "white", size = 6) +
  annotate("text", x = 3, y = 0.1, label = "2.5%", size = 6) +
  annotate("text", x = -3, y = 0.1, label = "2.5%", size = 6) +
  geom_vline(xintercept = 1.96, linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = -1.96, linetype = "dashed", linewidth = 1) +
  annotate("text", x = 2.6, y = 0.35, label = "z=+1.96", size = 6) +
  annotate("text", x = -2.6, y = 0.35, label = "z=-1.96", size = 6) +
  annotate("text", x = 3, y = 0.25, label = "Rejection \nregion", size = 6) +
  annotate("text", x = -3, y = 0.25, label = "Rejection \nregion", size = 6) +
  xlim(-4, 4) +
  labs(x = "z", y = "f(z)")

2.7 Statistical errors

Any statistical decision has some probability of being an error
- Type I error: Detecting an effect in the sample that doesn’t exist in the population
- Type II error: Not detecting an effect in the sample that does exist in the population

	Effect in population	No effect in population
Effect in sample	Correct	Type I error (\(\alpha\))
No effect in sample	Type II error (\(\beta\))	Correct

2.8 Alpha (\(\alpha\))

\(\alpha\) = Probability of type I error
- P(incorrectly detecting an effect that doesn’t exist in the population)
Typical \(\alpha\) values: .05, .01, .001
- Correspond to 95%, 99%, 99.9% confidence levels, respectively
Fisher: Type I error should happen 1 time in 20 \(\rightarrow\) \(\alpha\) = .05

2.9 Beta (\(\beta\))

\(\beta\) = Probability of type II error
- P(incorrectly missing an effect that exists in the population)
Statistical power = \(1 - \beta\)
- Probability of correctly detecting an effect that does exist
Typical \(\beta\) values: .2, .1
- Correspond to .8, .9 power, respectively
Fisher: Type II:Type I ratio = 4 \(\rightarrow\) 0.2:0.05 = 4

3 One sample tests

3.1 One sample tests

Compare one sample mean to a hypothesized population mean value
Reject \(H_0\)
- Sample is unlikely to have come from a population with that mean
Retain \(H_0\)
- Sample could have come from a population with that mean
Three tests here: \(z\)-test, \(t\)-test, binomial test

3.2 \(z\)-test: Assumptions

Data are continuous (i.e., ratio or interval)
Data are randomly sampled from the population
Data are independent
Data are approximately normally distributed OR sample size is large enough for normally distributed sampling distribution (central limit theorem)
Population variance (or standard deviation) is known

3.3 \(z\)-test: Hypotheses

Directional (one-tailed) tests
- \(H_0\): \(\mu \le \mu_0\)
- \(H_1\): \(\mu > \mu_0\)
Non-directional (two-tailed) tests
- \(H_0\): \(\mu = \mu_0\)
- \(H_1\): \(\mu \ne \mu_0\)

3.4 \(z\)-test: Test statistic

\[z = \frac{\bar{X} - \mu_0}{\sigma /\sqrt{n}}\]

Sample mean minus hypothesized population mean, divided by standard error
- Standard error is the standard deviation of the sampling distribution

3.5 \(z\)-test: Decision

Determine critical value for \(\alpha\)
- e.g., 1.96 for 2-tailed test at \(\alpha = .05\)
If observed \(z\) statistic is more extreme than the critical value
- Reject \(H_0\)
Otherwise, retain \(H_0\)

3.6 \(z\)-test: Example 1

Sample of \(n\) = 50
Known population SD = 15
Does this sample come from a population with a mean of 100?
Two-tailed test
- \(H_0\): \(\mu = 100\)
- \(H_1\): \(\mu \ne 100\)

Code

sample1 <- rnorm(n = 50, mean = 103, sd = 15)

3.7 \(z\)-test: Example 2

\[z = \frac{\bar{X} - \mu_0}{\sigma /\sqrt{n}}\]

mean(sample1)

[1] 105.6935

z <- (mean(sample1) - 100) / (15 / sqrt(50)) 
z

[1] 2.683939

Two tailed test, \(\alpha\) = .05: Critical value = 1.96
- Is the observed test statistic greater than +1.96 or lower than -1.96?
- If yes, reject \(H_0\)

3.8 \(z\)-test: Example 3

Another option: z.test() function from BSDA (Basic Statistics and Data Analysis) package
- x: Data
- alternative: “two.tailed” (default), “greater”, or “less”
- mu: Hypothesized population mean (\(\mu_0\))
- sigma.x: Known population standard deviation
- conf.level: Confidence level (default = .95)

3.9 \(z\)-test: Example 4

library(BSDA)
z.test(x = sample1, 
       alternative = "two.sided",
       mu = 100, 
       sigma.x = 15,
       conf.level = .95)


    One-sample z-Test

data:  sample1
z = 2.6839, p-value = 0.007276
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
 101.5358 109.8512
sample estimates:
mean of x 
 105.6935

3.10 \(z\)-test: Report results

\(p\)-value < .05 and 95% confidence interval don’t contain 100
- Reject \(H_0: \mu = 100\)
- This sample came from a population with a different mean
Using a one sample \(z\)-test, we rejected the null hypothesis that \(\mu\) = 100, \(z\) = 2.684, \(p\) = 0.007.

3.11 What is the \(p\)-value?

P(detecting an effect this large or larger if the null hypothesis is true)

Code

mu0 <- paste0("mu[0]")
ggplot(data.frame(x = c(90, 110)), aes(x)) +
  stat_function(fun = dnorm,
                args = list(mean = 100, sd = 15/sqrt(50)),
                geom = "line",
                xlim = c(90, 110)) +
  stat_function(fun = dnorm,
                args = list(mean = 100, sd = 15/sqrt(50)),
                geom = "area",
                fill = "steelblue",
                xlim = c(100 - 1.96*15/sqrt(50), 100 + 1.96*15/sqrt(50))) +
  stat_function(fun = dnorm,
                args = list(mean = 100, sd = 15/sqrt(50)),
                geom = "area",
                fill = "red",
                xlim = c(mean(sample1), 110)) +
  geom_vline(xintercept = mean(sample1), linewidth = 1) +
  annotate("text", x = mean(sample1) - 2, y = 0.15, label = "Observed \nmean", size = 6) +
  annotate("text", x = mean(sample1) + 2, y = 0.02, label = "p-value", size = 6, color = "red") +
  geom_vline(xintercept = 100, linewidth = 1, color = "white", linetype = "dashed") +
  annotate("text", x = 99, y = 0.05, label = mu0, size = 6, color = "white", parse = TRUE) +
  xlim(90, 110) +
  labs(x = "X", y = "f(x)")

3.12 \(t\)-test: Assumptions

Data are continuous (i.e., ratio or interval)
Data are randomly sampled from the population
Data are independent
Data are approximately normally distributed OR sample size is large enough for normally distributed sampling distribution (central limit theorem)
Population variance (or standard deviation) is unknown

3.13 Unknown population variance

Population variance is not known when using a \(t\)-test
- So we estimate it with the sample variance

Estimate	Definition	Computational
Population variance (\(\sigma^2\))	\(\frac{\sum (X - \mu)^2}{N}\)	\(\frac{\sum X^2 - \frac{(\sum X)^2}{N}}{N}\)
Sample variance (\(s^2\))	\(\frac{\sum (X - \overline{X})^2}{n-1}\)	\(\frac{\sum X^2 - \frac{(\sum X)^2}{n}}{n-1}\)

3.14 \(t\)-test: Hypotheses

Directional (one-tailed) tests
- \(H_0\): \(\mu \le \mu_0\)
- \(H_1\): \(\mu > \mu_0\)
Non-directional (two-tailed) tests
- \(H_0\): \(\mu = \mu_0\)
- \(H_1\): \(\mu \ne \mu_0\)

3.15 \(t\)-test: Test statistic

\[t = \frac{\bar{X} - \mu_0}{s /\sqrt{n}}\]

Sample mean minus hypothesized population mean, divided by standard error
- Standard error is the standard deviation of the sampling distribution

3.16 Degrees of freedom

How many independent pieces of information do we have?
- In general, this is the sample size (\(n\))
But if we estimate something (e.g., \(\bar{X}\)) and then use that to estimate something else (e.g., \(s^2\)), we have less information
- Once you know \(n - 1\) values and the mean, you know everything
Degrees of freedom quantify how much independent information

3.17 Degrees of freedom: \(t\) distribution

Code

ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), geom = "line", color = "black", linewidth = 1) +
  stat_function(fun = dt, args = list(df = 1, ncp = 0), geom = "line", color = "red", linewidth = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 30, ncp = 0), geom = "line", color = "darkgreen", linewidth = 1, linetype = "dashed") +
  ylim(0,.5) + 
  scale_x_continuous(breaks = -3:3) +
  annotate("text", x = 0, y = 0.29, label = "t(1)", color = "red", size = 6) +
  annotate("text", x = 0, y = 0.37, label = "t(30)", color = "darkgreen", size = 6) +
  annotate("text", x = 0, y = 0.43, label = "z", color = "black", size = 6) +
  labs(x = "t", y = "f(t)")

3.18 \(t\)-test: Decision

Determine critical value for \(\alpha\) and degrees of freedom
- Degrees of freedom = \(n - 1\)
If observed \(t\) statistic is more extreme than the critical value
- Reject \(H_0\)
Otherwise, retain \(H_0\)

3.19 \(t\)-test: Example 1

Same data as for \(z\)-test
Sample of \(n\) = 50
Unknown population SD
Does this sample come from a population with a mean of 100?
Two-tailed test
- \(H_0\): \(\mu = 100\)
- \(H_1\): \(\mu \ne 100\)

3.20 \(t\)-test: Example 2

\[t = \frac{\bar{X} - \mu_0}{s /\sqrt{n}}\]

mean(sample1)

[1] 105.6935

t <- (mean(sample1) - 100) / (sd(sample1) / sqrt(50)) 
t

[1] 2.447541

Two tailed test, \(\alpha\) = .05, df = 49: Critical value = 2.01
- Is the observed test statistic greater than +2.01 or lower than -2.01?
- If yes, reject \(H_0\)

3.21 \(t\)-test: Example 3

Another option: t.test() function in stats package
- x: Data
- alternative: “two.sided” (default), “greater”, or “less”
- mu: Hypothesized population mean (\(\mu_0\))
- conf.level: Confidence level (default = .95)

3.22 \(t\)-test: Example 4

t.test(x = sample1,
       alternative = "two.sided",
       mu = 100, 
       conf.level = .95)


    One Sample t-test

data:  sample1
t = 2.4475, df = 49, p-value = 0.01801
alternative hypothesis: true mean is not equal to 100
95 percent confidence interval:
 101.0188 110.3682
sample estimates:
mean of x 
 105.6935

3.23 \(t\)-test: Report results

\(p\)-value < .05 and 95% confidence interval don’t contain 100
- Reject \(H_0: \mu = 100\)
- This sample came from a population with a different mean
Using a one sample \(t\)-test, we rejected the null hypothesis that \(\mu\) = 100, \(t(49)\) = 2.448, \(p\) = 0.018

3.24 Compare \(z\)-test and \(t\)-test

	\(z\)-test	\(t\)-test
Sample mean	105.69	105.69
Population SD	15 (known)	16.45 (estimated)
Test statistic	2.684	2.448
Degrees of freedom	N/A	49
Critical value	1.96	2.01
\(p\)-value	0.007	0.018
Decision	Reject \(H_0\)	Reject \(H_0\)

3.25 \(\mathcal{N}(0, 1)\) vs \(t(49)\)

Code

ggplot(data.frame(x = c(-3, 3)), aes(x)) +
  stat_function(fun = dnorm, args = list(mean = 0, sd = 1), geom = "line", color = "black", linewidth = 1) +
  #stat_function(fun = dt, args = list(df = 1, ncp = 0), geom = "line", color = "red", linewidth = 1, linetype = "dashed") +
  #stat_function(fun = dt, args = list(df = 10, ncp = 0), geom = "line", color = "darkgreen", linewidth = 1, linetype = "dashed") +
  stat_function(fun = dt, args = list(df = 49, ncp = 0), geom = "line", color = "red", linewidth = 1, linetype = "dashed") +
  ylim(0,.5) + 
  scale_x_continuous(breaks = -3:3) +
  labs(x = "", y = "")

3.26 Binomial test: Assumptions

Data are binomial (i.e., several 0,1 Bernoulli trials)
- With same probability of success (\(p\)) for each trial
Data are randomly sampled from the population
Data are independent

3.27 Binomial test: Hypotheses

Directional (one-tailed) tests
- \(H_0\): \(\pi \le \pi_0\)
- \(H_1\): \(\pi > \pi_0\)
Non-directional (two-tailed) tests
- \(H_0\): \(\pi = \pi_0\)
- \(H_1\): \(\pi \ne \pi_0\)

3.28 Binomial test: Test and decision

\(P(X = x) = {m \choose x} p^x (1 - p)^{m -x}\)

Binomial test uses the binomial distribution
- Exact test
- Does not require “large” sample
- Compare observed proportion (i.e., number of successes) to binomial distribution with hypothesized \(m\) and \(p\)

3.29 Binomial test: Example 1

Sample of \(n\) = 50 coin flips
Does this sample come from a fair coin (\(\pi\) = 0.5)?
Two-tailed test
- \(H_0\): \(\pi = 0.5\)
- \(H_1\): \(\pi \ne 0.5\)

Code

sample2 <- rbinom(50, 1, 0.53)
table(sample2)

sample2
 0  1 
30 20

3.30 Binomial test: Example 2

\(P(X = x) = {m \choose x} p^x (1 - p)^{m - x}\)

Code

binom_dat <- data.frame(x = 0:50, y = dbinom(0:50, 50, 0.5))
ggplot(data = binom_dat, aes(x = x, y = y)) +
  geom_col() +
  #scale_x_continuous(breaks = 0:50) +
  ylim(0,0.15) +
  labs(x = "X", y = "P(X = x)") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

3.31 Binomial test: Example 3

binom.test() function from stats package
- x: number of successes
- n: sample size
- p: hypothesized probability of success (median = 50th %ile)
- alternative: “two.sided” (default), “less”, “greater”

3.32 Binomial test: Example 4

binom.test(x = sum(sample2 > 0), 
           n = length(sample2), 
           p = 0.5, 
           alternative = "two.sided")


    Exact binomial test

data:  sum(sample2 > 0) and length(sample2)
number of successes = 20, number of trials = 50, p-value = 0.2026
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.2640784 0.5482060
sample estimates:
probability of success 
                   0.4

3.33 Binomial test: Report results

\(p\)-value > .05 and 95% confidence interval contain 0.5
- Retain \(H_0: \pi = 0.5\)
- This sample came from a population with mean = 0.5
Using a binomial test, we retained the null hypothesis that \(\pi\) = 0.5, observed mean = 0.4, \(p\) = 0.2026.

3.34 Alternative: \(z\)-test for proportion

Proportion = mean of binary (i.e., Bernoulli) variable
- \(z\)-test as test of proportion
Observed distribution is not normal
- Sampling distribution is normal with a “large” sample
- Approximate test based on assumption of normality
Assumptions, hypotheses, critical values are the same as \(z\)-test
- Replace \(\mu\) with \(\pi\) and \(\bar{X}\) with \(p\)

3.35 Proportion: \(z\)-test statistic

\[z = \frac{p - \pi_0}{\sqrt{\pi_0(1-\pi_0)/n}}\]

If \(H_0\) is true
- Population mean = \(\pi_0\)
- Population variance = \(\pi_0(1-\pi_0)\)
Works best when \(p\) and \(\pi\) are between 0.2 and 0.8

4 Non-parametric tests

4.1 Parametric vs non-parametric tests

Parametric = following a distribution
- e.g., sampling distribution of the mean is normally distributed
Non-parametric = not following a distribution
- Don’t assume any particular sampling distribution
- Often assume the variables don’t have a particular distribution

4.2 Non-parametric tests

Many non-parametric tests
- Not well organized or named
Two commonly used one sample non-parametric tests
- Sign test: Test of median for ordinal+ data
- Wilcoxon signed rank test: Test of median for interval+ data

4.3 Why non-parametric tests?

Non-numeric data (i.e., nominal or ordinal)
Continuous but very non-normal variables
Especially with smaller samples, where CLT doesn’t apply
Note that parametric tests are more powerful than non-parametric tests if their assumptions are met

4.4 Sign test: Assumptions

Data are at least ordinal (ordinal, interval, or ratio)
Data are randomly sampled from the population
No assumptions about distribution of the data
No assumptions about sampling distribution of the median
Hypotheses are about median

4.5 Sign test: Logic

Count observations above and below hypothesized median
- If observed median = population median: 50% above, 50% below
Compare to binomial distribution to determine probability of observed number of “successes”

4.6 Sign test: Example

binom.test() function from stats package
- x: number of successes (here, values > hypothesized median)
- n: sample size
- p: hypothesized probability of success (median = 50th %ile)
- alternative: “two.sided” (default), “less”, “greater”

4.7 Sign test: Example

binom.test(x = sum(sample1 > 100), 
           n = length(sample1), 
           p = 0.5, 
           alternative = "two.sided")


    Exact binomial test

data:  sum(sample1 > 100) and length(sample1)
number of successes = 32, number of trials = 50, p-value = 0.06491
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.4919314 0.7708429
sample estimates:
probability of success 
                  0.64

4.8 Wilcoxon Signed rank test: Assumptions

Data are at least interval (interval or ratio)
Data are randomly sampled from the population
No assumptions about distribution of the data
No assumptions about sampling distribution of the median
Hypotheses are about median

4.9 Signed rank test: Logic

Rank absolute values of all values
Above hypothesized median: Assign “+”
Below hypothesized median: Assign “-”
Add positive ranks, add negative ranks
Compare to distribution under the null hypothesis
- i.e., sum of positive ranks = sum of negative ranks

4.10 Signed rank test: Example

wilcox.test() function from stats package

median(sample1)

[1] 109.4636

wilcox.test(sample1, mu = 100, alternative = "two.sided")


    Wilcoxon signed rank test with continuity correction

data:  sample1
V = 877, p-value = 0.02105
alternative hypothesis: true location is not equal to 100

5 Summary

5.1 Hypothesis testing

Null and alternative hypotheses
- Retain or reject \(H_0\)
Hypotheses are about the population value (mean, median, etc.)
- We use the sample to make inference about the population
If we reject \(H_0\), that just means that it’s very unlikely that the sample came from a population with that mean
- We can still make a type I error

5.2 Deciding on a test

Normal outcome, known \(\sigma^2\): \(z\)-test
Normal outcome, unknown \(\sigma^2\): \(t\)-test
Proportion: Binomial test for small sample, \(z\)-test for large sample
Ordinal or very non-normal, especially with small sample: Non-parametric test

6 In-class activities

6.1 In-class activities

Assess some hypotheses about some variables
- What type of variable?
- What type of test?
- What do we conclude?

6.2 Next week

Tests for two unrelated samples
Tests for two related samples
Chi-square test of independence for two binary variables