Introduction to Biostatistics

1 Learning objectives

1.1 Learning objectives

  • Interpret tests comparing two unrelated samples
  • Summarize data using contingency tables
  • Describe different study designs for contingency tables

2 Two independent samples

2.1 Independent samples tests

  • Compare means (or medians) from two unrelated samples
  • Do the two samples come from populations with the same mean?
    • Is the difference between the two population means 0?

2.2 Independent samples tests

  • Parametric tests
    • Independent samples \(z\)-test
    • Independent samples \(t\)-test (also Welch’s \(t\)-test)
  • Non-parametric tests
    • Median test
    • Wilcoxon-Mann-Whitney U test
    • Chi-square test and Fisher’s exact test (next week)

2.3 Independence

  • Statistical independence
    • Two events are independent if the occurrence of one event doesn’t impact the occurrence of another event
    • Knowing about one event tells you nothing about the other event
  • Independent samples
    • Individuals in one sample are not related to individuals in the other sample
    • The two samples are made up of different individuals

2.4 \(z\)-test: Assumptions

  • Data are continuous (i.e., ratio or interval)
  • Data are randomly sampled from the population(s)
  • Data are independent (within and across groups)
  • Data are approximately normally distributed OR sample size is large enough for normally distributed sampling distribution (central limit theorem)
  • Population variance (or SD) is known and same in both groups
    • “Large sample”: population variance estimated by sample variance

2.5 \(z\)-test: Hypotheses

  • Directional (one-tailed) tests
    • \(H_0\): \(\mu_1 \le \mu_2\) or \(\mu_1 - \mu_2 \le 0\)
      • \(H_1\): \(\mu_1 > \mu_2\) or \(\mu_1 - \mu_2 > 0\)
    • \(H_0\): \(\mu_1 \ge \mu_2\) or \(\mu_1 - \mu_2 \ge 0\)
      • \(H_1\): \(\mu_1 < \mu_2\) or \(\mu_1 - \mu_2 < 0\)
  • Non-directional (two-tailed) tests
    • \(H_0\): \(\mu_1 = \mu_2\) or \(\mu_1 - \mu_2 = 0\)
      • \(H_1\): \(\mu_1 \ne \mu_2\) or \(\mu_1 - \mu_2 \ne 0\)

2.6 \(z\)-test: Test statistic

\[z = \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma^2_1}{n_1} + \frac{\sigma^2_2}{n_2}}}\]

  • \(\mu_1 - \mu_2 = 0\) (according to \(H_0\))
  • \(\sigma^2_1\) and \(\sigma^2_2\) are the variances of the two groups
  • \(n_1\) and \(n_2\) are the sample sizes of the two groups

2.7 \(z\)-test: Example 1

  • Is resting pulse rate the same for smokers and non-smokers?
    • One dataset / variable (column) for each group
Code
library(Stat2Data)
data(Pulse)
#head(Pulse)
library(tidyverse)
Pulse_smoke <- Pulse %>% filter(Smoke == 1)
Pulse_nosmoke <- Pulse %>% filter(Smoke == 0)
head(Pulse_smoke)
  Active Rest Smoke Sex Exercise Hgt Wgt
1     82   68     1   0        3  70 225
2     86   68     1   0        2  73 195
3     87   72     1   0        2  70 173
4    102   77     1   0        2  72 200
5     80   67     1   1        2  65 133
6     99   78     1   0        3  71 165
Code
head(Pulse_nosmoke)
  Active Rest Smoke Sex Exercise Hgt Wgt
1     97   78     0   1        1  63 119
2     88   62     0   0        3  72 175
3    106   74     0   0        3  72 170
4     78   63     0   1        3  67 125
5    109   65     0   0        3  74 188
6     66   43     0   1        3  67 140

2.8 \(z\)-test: Example 2

  • Is resting pulse rate the same for smokers and non-smokers?
    • Check out the data
Group n Mean SD
Non-smokers 206 67.791 9.851
Smokers 26 72.769 9.799

2.9 \(z\)-test: Example 3

  • Is resting pulse rate the same for smokers and non-smokers?
    • Check out the data
Code
ggplot(data = Pulse_nosmoke, 
       aes(x = Rest)) +
  geom_histogram(fill = "red", 
                 alpha = 0.5, 
                 bins = 30) +
  geom_histogram(data = Pulse_smoke, 
                 aes(x = Rest), 
                 fill = "black", 
                 bins = 30)

2.10 \(z\)-test: Example 4

  • Is resting pulse rate the same for smokers and non-smokers?
    • z.test() function from BSDA package
library(BSDA)
ztest <- z.test(x = Pulse_smoke$Rest, 
                y = Pulse_nosmoke$Rest, 
                sigma.x = sd(Pulse_smoke$Rest), 
                sigma.y = sd(Pulse_nosmoke$Rest), 
                alternative = "two.sided")
ztest

    Two-sample z-Test

data:  Pulse_smoke$Rest and Pulse_nosmoke$Rest
z = 2.4394, p-value = 0.01471
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.9783252 8.9776121
sample estimates:
mean of x mean of y 
 72.76923  67.79126 

2.11 \(z\)-test: Report results

  • Reject \(H_0\): \(\mu_1 = \mu_2\) or \(\mu_1 - \mu_2 = 0\)
    • \(p\)-value < .05
    • 95% confidence interval doesn’t contain 0 (value from \(H_0\))
    • These two samples came from different populations
  • Using a two-sample \(z\)-test, we rejected the null hypothesis that the means are equal, \(z\) = 2.44, \(p\) = .01
    • Smokers and non-smokers have different resting pulse rates

2.12 \(t\)-test: Assumptions

  • Data are continuous (i.e., ratio or interval)
  • Data are randomly sampled from the population
  • Data are independent (within and across groups)
  • Data are approximately normally distributed OR sample size is large enough for normally distributed sampling distribution (central limit theorem)
  • Population variance (or SD) is unknown and same in both groups

2.13 \(t\)-test: Hypotheses

  • Directional (one-tailed) tests
    • \(H_0\): \(\mu_1 \le \mu_2\) or \(\mu_1 - \mu_2 \le 0\)
      • \(H_1\): \(\mu_1 > \mu_2\) or \(\mu_1 - \mu_2 > 0\)
    • \(H_0\): \(\mu_1 \ge \mu_2\) or \(\mu_1 - \mu_2 \ge 0\)
      • \(H_1\): \(\mu_1 < \mu_2\) or \(\mu_1 - \mu_2 < 0\)
  • Non-directional (two-tailed) tests
    • \(H_0\): \(\mu_1 = \mu_2\) or \(\mu_1 - \mu_2 = 0\)
      • \(H_1\): \(\mu_1 \ne \mu_2\) or \(\mu_1 - \mu_2 \ne 0\)

2.14 \(t\)-test: Test statistic

\[t = \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s^2_p}{n_1} + \frac{s^2_p}{n_2}}}\]

  • where \(s^2_p = \frac{(n_1 -1)s^2_1 + (n_2 -1)s^2_2}{n_1 + n_2 -2}\)
    • Assumes equal variances and pools (combines)
  • with degrees of freedom = \(n_1 + n_2 - 2\)

2.15 \(t\)-test: Example 1

  • Is resting pulse rate the same for smokers and non-smokers?
    • t.test() function from stats package
ttest <- t.test(x = Pulse_smoke$Rest, 
                y = Pulse_nosmoke$Rest, 
                alternative = "two.sided",
                var.equal = TRUE)
ttest

    Two Sample t-test

data:  Pulse_smoke$Rest and Pulse_nosmoke$Rest
t = 2.4294, df = 230, p-value = 0.01589
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.9405909 9.0153463
sample estimates:
mean of x mean of y 
 72.76923  67.79126 

2.16 \(t\)-test: Report results

  • Reject \(H_0\): \(\mu_1 = \mu_2\) or \(\mu_1 - \mu_2 = 0\)
    • \(p\)-value < .05
    • 95% confidence interval doesn’t contain 0 (value from \(H_0\))
    • These two samples came from different populations
  • Using a independent samples \(t\)-test, we rejected the null hypothesis that the means are equal, \(t(230)\) = 2.43, \(p\) = .02
    • Smokers and non-smokers have different resting pulse rates

2.17 Welch’s \(t\)-test: Unequal variances

\[t = \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}}\]

  • with degrees of freedom = something much more complicated
    • May be fractional

2.18 Welch’s \(t\)-test: Example

welch <- t.test(x = Pulse_smoke$Rest, 
                y = Pulse_nosmoke$Rest, 
                alternative = "two.sided",
                var.equal = FALSE)
welch

    Welch Two Sample t-test

data:  Pulse_smoke$Rest and Pulse_nosmoke$Rest
t = 2.4394, df = 31.721, p-value = 0.02049
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.8198241 9.1361131
sample estimates:
mean of x mean of y 
 72.76923  67.79126 

2.19 Welch’s \(t\)-test: Report results

  • Reject \(H_0\): \(\mu_1 = \mu_2\) or \(\mu_1 - \mu_2 = 0\)
    • \(p\)-value < .05
    • 95% confidence interval doesn’t contain 0 (value from \(H_0\))
    • These two samples came from different populations
  • Using a Welch’s \(t\)-test, we rejected the null hypothesis that the means are equal, \(t(31.721)\) = 2.44, \(p\) = .02
    • Smokers and non-smokers have different resting pulse rates

2.20 Very similar results

Test Diff Statistic df \(p\) CI
\(z\) 4.978 2.439 NA 0.015 [0.978, 8.978]
\(t\) 4.978 2.429 230 0.016 [0.941, 9.015]
Welch 4.978 2.439 31.721 0.02 [0.82, 9.136]

2.21 Non-parametric tests

  • Non-parametric versions of \(z\) or \(t\)-test for 2 independent samples
    • Median test
    • Wilcoxon-Mann-Whitney U test (or Mann-Whitney or Mann-Whitney-Wilcoxon)
  • Differences in median (technically, “location”)
  • Mann-Whitney is a somewhat better test

2.22 Non-parametric tests

  • Median test
Code
library(coin)
median_test(Rest ~ as.factor(Smoke), data = Pulse)

    Asymptotic Two-Sample Brown-Mood Median Test

data:  Rest by as.factor(Smoke) (0, 1)
Z = -0.88412, p-value = 0.3766
alternative hypothesis: true mu is not equal to 0
  • Wilcoxon-Mann-Whitney test
Code
wilcox.test(Rest ~ as.factor(Smoke), data = Pulse)

    Wilcoxon rank sum test with continuity correction

data:  Rest by as.factor(Smoke)
W = 1967, p-value = 0.02746
alternative hypothesis: true location shift is not equal to 0

3 Contingency tables

3.1 Contingency tables

  • Cross-tabs, summary tables, 2x2 table
    • Relationship between two (or more) categorical variables
    • Each cell is a frequency for that combination
  • Sex and Smoke from the Pulse dataset
Code
smoke_sex <- table(Pulse$Sex, Pulse$Smoke)
colnames(smoke_sex) <- c("Non-smoker", "Smoker")
rownames(smoke_sex) <- c("Male", "Female")
smoke_sex_margins <- addmargins(smoke_sex)
smoke_sex_prop_margins <- addmargins(prop.table(smoke_sex))
smoke_sex
        
         Non-smoker Smoker
  Male          105     17
  Female        101      9

3.2 Notation for frequencies

  \(J = 1\) \(J = 2\)  
\(I = 1\) \(n_{11}\) \(n_{12}\) \(\color{white}{n_{1+}}\)
\(I = 2\) \(n_{21}\) \(n_{22}\)  
       

3.3 Notation for frequencies

  \(J = 1\) \(J = 2\)  
\(I = 1\) \(n_{11}\) \(n_{12}\) \(n_{1+}\)
\(I = 2\) \(n_{21}\) \(n_{22}\) \(n_{2+}\)
  \(n_{+1}\) \(n_{+2}\) \(n\)
  • \(n_{11}\), \(n_{12}\), \(n_{21}\), \(n_{22}\) are joint frequencies
  • \(n_{1+}\), \(n_{2+}\), \(n_{+1}\), \(n_{+2}\) are marginal frequencies

3.4 Notation for probabilities

  \(J = 1\) \(J = 2\)  
\(I = 1\) \(p_{11}\) \(p_{12}\) \(\color{white}{p_{1+}}\)
\(I = 2\) \(p_{21}\) \(p_{22}\)  
       

3.5 Notation for probabilities

  \(J = 1\) \(J = 2\)  
\(I = 1\) \(p_{11}\) \(p_{12}\) \(p_{1+}\)
\(I = 2\) \(p_{21}\) \(p_{22}\) \(p_{2+}\)
  \(p_{+1}\) \(p_{+2}\) \(1\)
  • \(p_{11}\), \(p_{12}\), \(p_{21}\), \(p_{22}\) are joint probabilities = \(\frac{n_{ij}}{n}\)
  • \(p_{1+}\), \(p_{2+}\), \(p_{+1}\), \(p_{+2}\) are marginal probabilities = \(\frac{n_{i+}}{n}\) or \(\frac{n_{+j}}{n}\)

3.6 Marginal probability

  • Marginal probability: Probability of \(X\) or \(Y\), collapsing over the other
    • What is the distribution of \(X\), ignoring \(Y\)?
    • What is the distribution of \(Y\), ignoring \(X\)?

3.7 Marginal probability

  • Start with frequencies
  \(Y\) = 1: No smoke \(Y\) = 2: Smoke  
\(X = 1\): Male \(n_{11} = \color{red}{105}\) \(n_{12} = \color{red}{17}\) \(n_{1+} = \color{blue}{122}\)
\(X = 2\): Female \(n_{21} = \color{red}{101}\) \(n_{22} = \color{red}{9}\) \(n_{2+} = \color{blue}{110}\)
  \(n_{+1} = \color{blue}{206}\) \(n_{+2} = \color{blue}{26}\) \(n = \color{blue}{232}\)

3.8 Marginal probability

  • Divide each value by \(n\): The total sample size
  \(Y\) = 1: No smoke \(Y\) = 2: Smoke  
\(X = 1\): Male \(p_{11} = \color{red}{0.45}\) \(p_{12} = \color{red}{0.07}\) \(p_{1+} = \color{blue}{0.53}\)
\(X = 2\): Female \(p_{21} = \color{red}{0.44}\) \(p_{22} = \color{red}{0.04}\) \(p_{2+} = \color{blue}{0.47}\)
  \(p_{+1} = \color{blue}{0.89}\) \(p_{+2} = \color{blue}{0.11}\) \(p = \color{blue}{1}\)
  • \(\color{red}{Joint}\) probabilities sum to 1
  • \(\color{blue}{Marginal}\) probabilities for rows sum to 1
  • \(\color{blue}{Marginal}\) probabilities for columns sum to 1

3.9 Conditional probability

  • Often, \(X\) is an explanatory variable, \(Y\) is an outcome variable
    • But it doesn’t need to be
  • Conditional probability: Probability of \(Y\) at a given value of \(X\)
    • When \(X = 1\), what is the distribution of \(Y\)?
    • When \(X = 2\), what is the distribution of \(Y\)?
  • Conditional probability is the \(\color{red}{joint~value}\) divided by the \(\color{blue}{marginal~value}\) for that value of \(X\)
    • It is conditional on that value of \(X\)

3.10 Conditional probability

  \(Y\) = 1: No smoke \(Y\) = 2: Smoke  
\(X = 1\): Male \(n_{11} = \color{red}{105}\) \(n_{12} = \color{red}{17}\) \(n_{1+} = \color{blue}{122}\)
\(X = 2\): Female \(n_{21} = \color{red}{101}\) \(n_{22} = \color{red}{9}\) \(n_{2+} = \color{blue}{110}\)
  \(n_{+1} = \color{blue}{206}\) \(n_{+2} = \color{blue}{26}\) \(n = \color{blue}{232}\)

3.11 Conditional probability

  \(Y\) = 1: No smoke \(Y\) = 2: Smoke  
\(X = 1\): Male \(n_{11} = \color{red}{105}\) \(n_{12} = \color{red}{17}\) \(n_{1+} = \color{blue}{122}\)
\(X = 2\): Female \(n_{21} = \color{red}{101}\) \(n_{22} = \color{red}{9}\) \(n_{2+} = \color{blue}{110}\)
  \(n_{+1} = \color{blue}{206}\) \(n_{+2} = \color{blue}{26}\) \(n = \color{blue}{232}\)
  • When \(X = 1\) (male):
    • \(P(no\ smoke) = \frac{\color{red}{105}}{\color{blue}{122}} = 0.861\)
    • \(P(smoke) = \frac{\color{red}{17}}{\color{blue}{122}} = 0.139\)
  • When \(X = 2\) (female):
    • \(P(no\ smoke) = \frac{\color{red}{101}}{\color{blue}{110}} = 0.918\)
    • \(P(smoke) = \frac{\color{red}{9}}{\color{blue}{110}} = 0.082\)

3.12 Sex and Smoke: Frequencies

Code
smoke_sex
        
         Non-smoker Smoker
  Male          105     17
  Female        101      9

3.13 Sex and Smoke: And margins

Code
smoke_sex_margins
        
         Non-smoker Smoker Sum
  Male          105     17 122
  Female        101      9 110
  Sum           206     26 232

3.14 Sex and Smoke: Marginal prob

Code
smoke_sex_prop_margins
        
         Non-smoker     Smoker        Sum
  Male   0.45258621 0.07327586 0.52586207
  Female 0.43534483 0.03879310 0.47413793
  Sum    0.88793103 0.11206897 1.00000000

3.15 Sex and Smoke: Conditional prob

Code
prop.table(smoke_sex, margin = 1)
        
         Non-smoker     Smoker
  Male   0.86065574 0.13934426
  Female 0.91818182 0.08181818
  • Conditional on Sex

4 Study design

4.1 Fixed vs random

  • The marginal frequencies of a contingency table can be either fixed or random
    • Fixed: Chosen by the researcher
      • e.g., Collect data on 50 men and 50 women
    • Random: Vary depending on the sample
      • e.g., Collect data on gender in the sample

4.2 Why do we care?

  • Probability = random divided by fixed
    • If there are no fixed marginals, we can’t calculate a probability
  • Ratio = random divided by random
    • We can always calculate ratios (e.g., odds ratios)
  • What is “fixed” is what you can “condition on”
    • What are probabilities “out of”?

4.3 Why do we care?

  • Design of the contingency table determines
    • What is conditioned on
    • What are probabilities (and what are just ratios)
    • How you can talk about the relationship in the table
    • What statistical tests you can perform

4.4 Types of study designs

  • Three study designs with different fixed and random marginals
    • Cross-sectional (or multinomial)
    • Retrospective
    • Prospective (or product binomial)

4.5 Overall study design

  Heart attack No heart attack \(\color{white}{White text}\)
Placebo      
Aspirin      
       
  • Relationship between aspirin use (vs placebo) and heart attack
    • \(X\): Aspirin vs placebo
    • \(Y\): Heart attack vs no heart attack

4.6 Cross-sectional

  Heart attack No heart attack \(\color{white}{White text}\)
Placebo     Random
Aspirin     Random
  Random Random Fixed
  • Collect data from \(n\) people
    • Measure aspirin vs placebo, heart attack vs not

4.7 Retrospective

  Heart attack No heart attack \(\color{white}{White text}\)
Placebo     Random
Aspirin     Random
  Fixed Fixed  
  • Collect data from specific numbers of heart attack and non patients
    • Measure whether they took aspirin

4.8 Prospective

  Heart attack No heart attack \(\color{white}{White text}\)
Placebo     Fixed
Aspirin     Fixed
  Random Random  
  • Collect data from specific number of aspirin and placebo people
    • Measure whether they have a heart attack

4.9 Smoke and Sex example

  • What type of study design was this?
    • Multinomial: Total \(n\) is fixed
    • Retrospective: Smoke is fixed
    • Prospective: Sex is fixed
Code
smoke_sex_margins
        
         Non-smoker Smoker Sum
  Male          105     17 122
  Female        101      9 110
  Sum           206     26 232

4.10 Measures of relationship

  • Any design
    • Odds ratio
    • Test of independence
  • Prospective design only
    • Difference in proportion
    • Relative risk
    • Odds ratio
    • Test of independence

4.11 Difference in proportion

  • If we assume it is a prospective design re: Sex
  • Male
    • \(P(smoke) = \frac{\color{red}{17}}{\color{blue}{122}} = 0.139\)
  • Female
    • \(P(smoke) = \frac{\color{red}{9}}{\color{blue}{110}} = 0.082\)
  • Difference in proportion = \(0.139 - 0.082 = 0.058\)
    • A man is 5.8 percentage points more likely to be a smoker than a woman

4.12 Relative risk

  • If we assume it is a prospective design re: Sex
  • Male
    • \(P(smoke) = \frac{\color{red}{17}}{\color{blue}{122}} = 0.139\)
  • Female
    • \(P(smoke) = \frac{\color{red}{9}}{\color{blue}{110}} = 0.082\)
  • Relative risk = \(\frac{0.139}{0.082} = 1.703\)
    • A man is 1.703 times more likely to be a smoker than a woman

4.13 Odds ratio

\[odds(smoke) = P(smoke)/P(no\ smoke)\]

  • Male
    • \(P(smoke) = \frac{\color{red}{17}}{\color{blue}{122}} = 0.139\)
    • \(odds = \frac{0.139}{0.861} = 0.162\)
  • Female
    • \(P(smoke) = \frac{\color{red}{9}}{\color{blue}{110}} = 0.082\)
    • \(odds = \frac{0.082}{0.918} = 0.089\)
  • \(odds\ ratio = \frac{0.089}{0.162} = 0.55\)
    • The odds of a woman smoking is 0.55 times the odds of a man smoking

4.14 Independence

  • Do the observed frequencies match what we’d expect if the variables were unrelated or independent?
  • Observed
Code
smoke_sex
        
         Non-smoker Smoker
  Male          105     17
  Female        101      9
  • Expected (independence)
Code
library(epitools)
expected(smoke_sex)
        
         Non-smoker   Smoker
  Male    108.32759 13.67241
  Female   97.67241 12.32759

5 In-class activities

5.1 In-class activities

  • Conduct two independent samples tests
    • \(z\)-test, \(t\)-test, Welch’s \(t\)-test
  • Look at contingency tables and orient to them
  • Think about study design

5.2 Next week

  • Relate study design to tests for contingency tables
    • Some tests are only available for some designs
  • Related-samples tests