BTS 510 Lab 11

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(12345)
theme_set(theme_classic(base_size = 16))

1 Learning objectives

  • Describe family-wise error rate (FWER) methods of multiple comparison correction
  • Describe false discovery rate (FDR) methods of multiple comparison correction
  • Compare and contrast the methods

2 Data

  • mtept dataset from multicomp package
    • A data frame with 111 observations on the following 5 variables.
      • treatment: a factor with levels Drug Placebo
      • E1: endpoint 1
      • E2: endpoint 2
      • E3: endpoint 3
      • E4: endpoint 4
library(multcomp)
Loading required package: mvtnorm
Loading required package: survival
Loading required package: TH.data
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select

Attaching package: 'TH.data'
The following object is masked from 'package:MASS':

    geyser
data(mtept)
head(mtept)
   treatment E1 E2 E3 E4
1    Placebo  4  3  3  5
5    Placebo  5  0  1  7
9    Placebo  1  0  1  9
13   Placebo  4  0  3  5
17   Placebo  3  0  2  9
21   Placebo  4  1  2  6

3 Analyses

  • t-test for each endpoint (outcome variable)
    • Comparing treatment (Drug vs Placebo)
    • Save each analysis with a unique name: t_e1, t_e2, t_e3, t_e4
t_e1 <- t.test(mtept$E1 ~ mtept$treatment, 
               alternative = "two.sided",
               var.equal = FALSE)
t_e1

    Welch Two Sample t-test

data:  mtept$E1 by mtept$treatment
t = -2.5464, df = 106.8, p-value = 0.01231
alternative hypothesis: true difference in means between group Drug and group Placebo is not equal to 0
95 percent confidence interval:
 -1.2064907 -0.1502344
sample estimates:
   mean in group Drug mean in group Placebo 
             2.543860              3.222222 
t_e2 <- t.test(mtept$E2 ~ mtept$treatment, 
               alternative = "two.sided",
               var.equal = FALSE)
t_e2

    Welch Two Sample t-test

data:  mtept$E2 by mtept$treatment
t = -2.4363, df = 62.878, p-value = 0.01768
alternative hypothesis: true difference in means between group Drug and group Placebo is not equal to 0
95 percent confidence interval:
 -2.7569954 -0.2722443
sample estimates:
   mean in group Drug mean in group Placebo 
            0.9298246             2.4444444 
t_e3 <- t.test(mtept$E3 ~ mtept$treatment, 
               alternative = "two.sided",
               var.equal = FALSE)
t_e3

    Welch Two Sample t-test

data:  mtept$E3 by mtept$treatment
t = -1.2868, df = 102.85, p-value = 0.2011
alternative hypothesis: true difference in means between group Drug and group Placebo is not equal to 0
95 percent confidence interval:
 -0.9511305  0.2025925
sample estimates:
   mean in group Drug mean in group Placebo 
             2.403509              2.777778 
t_e4 <- t.test(mtept$E4 ~ mtept$treatment, 
               alternative = "two.sided",
               var.equal = FALSE)
t_e4

    Welch Two Sample t-test

data:  mtept$E4 by mtept$treatment
t = 2.3814, df = 108.91, p-value = 0.01898
alternative hypothesis: true difference in means between group Drug and group Placebo is not equal to 0
95 percent confidence interval:
 0.1258604 1.3751143
sample estimates:
   mean in group Drug mean in group Placebo 
             7.491228              6.740741 
Note

If you had a lot of tests to do, you wouldn’t do it this way. You’d write a function to loop through all the analyses and save the results. I don’t want us to get caught up in the technicalities of that, so we’re doing it this way.

3.1 Structure of output

  • Let’s look at the “structure” of the output using the str() function
    • This lists all of the pieces in the output (including some that are not typically shown)
    • If you know the name of an object, you can do things with that object
str(t_e1)
List of 10
 $ statistic  : Named num -2.55
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 107
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.0123
 $ conf.int   : num [1:2] -1.21 -0.15
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num [1:2] 2.54 3.22
  ..- attr(*, "names")= chr [1:2] "mean in group Drug" "mean in group Placebo"
 $ null.value : Named num 0
  ..- attr(*, "names")= chr "difference in means between group Drug and group Placebo"
 $ stderr     : num 0.266
 $ alternative: chr "two.sided"
 $ method     : chr "Welch Two Sample t-test"
 $ data.name  : chr "mtept$E1 by mtept$treatment"
 - attr(*, "class")= chr "htest"

3.2 Save p-values

  • The p-value for the test is called p.value
    • You can refer to the p-value for the first analysis as t_e1$p.value
    • Save the p-values in a vector
pvalues <- c(t_e1$p.value, t_e2$p.value, t_e3$p.value, t_e4$p.value)
pvalues
[1] 0.01231127 0.01768143 0.20106357 0.01898487

4 Multiple comparison methods

  • p.adjust() function in stats package
    • p: list of p-values for the tests
    • method: Method of adjustment
      • "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
    • n: number of tests

4.1 Bonferroni correction

  • Adjusted p-values are multiplied by the number of tests
    • Here, each observed p-value is multiplied by 4
  • Compare to nominal \alpha (typically .05)
p.adjust(pvalues, method = "bonferroni", n = 4)
[1] 0.04924507 0.07072572 0.80425429 0.07593949

4.2 Holm correction

  • Sort observed p-values (smallest to largest)
  • Adjusted p-values are multiplied by the number of remaining tests
    • Here, smallest p-value multiplied by 4, next smallest by 3, next smallest by 2, largest by 1
  • Compare to nominal \alpha (typically .05)
p.adjust(pvalues, method = "holm", n = 4)
[1] 0.04924507 0.05304429 0.20106357 0.05304429

4.3 False discovery rate (FDR)

  • Sort observed p-values (smallest to largest)
  • Adjusted p-values are multiplied by \frac{total\ number\ of\ tests}{rank\ of\ the\ test}
    • Here, smallest p-value multiplied by 4/1, next smallest by 4/2, next smallest by 4/3, largest by 4/4
  • Compare to selected FDR (can be same as \alpha but typically higher)
p.adjust(pvalues, method = "BH", n = 4)
[1] 0.02531316 0.02531316 0.20106357 0.02531316

5 Compare

  • How many of the tests were significant for each method?
    • Which ones were selected for each?
  • Does one method seem much better or different from the others?
  • Which would you select to use? Why?

6 Another example

  • Data from The Hitchhiker’s Guide to Longitudinal Models
    • Download the data here
  • executive-function dataset (Excel / CSV file on course website, loaded here)
    • N = 342 adolescents
    • id: ID variable
    • sex: Self-identified sex
    • tx: Treatment
    • dlpfc1 to dlpfc4: Dorsolateral prefrontal cortex activation during executive functioning task across 4 years
    • ef1 to ef4: Scores on executive functioning task across 4 years
    • age1 to age4: Age at each wave
  • Simulated data made to resemble real data
exfxn <- read.csv("executive-function.csv")
head(exfxn)
  id sex tx     dlpfc1    dlpfc2     dlpfc3    dlpfc4       ef1      ef2
1  1   1  0 -0.1839186 1.1288984 -0.8403271 0.4724899 2.1666665 1.805555
2  2   1  0  0.8006942 1.1288984  0.8006942 1.4571027 0.7222222 1.444444
3  3   1  1  0.4724899 1.1288984  0.1442857 0.1442857 3.2499998 3.250000
4  4   0  0  0.4724899 0.4724899  0.4724899 0.4724899 3.6111108 4.333333
5  5   1  0 -0.8403271 2.1135112  2.4417154 2.4417154 2.1666665 1.805555
6  6   1  0  2.1135112 1.4571027  1.4571027 0.1442857 2.1666665 2.166666
       ef3      ef4     age1     age2     age3     age4
1 1.444444 2.888889 12.02653 13.05762 14.07388 15.11197
2 1.805555 2.527778 12.08908 13.12362 13.99743 15.02093
3 2.527778 2.527778 11.95321 13.04755 13.82033 15.05837
4 3.972222 3.972222 12.07584 12.84529 13.81844 14.93139
5 2.888889 3.250000 11.93583 13.00555 13.97423 14.89815
6 2.166666 2.527778 12.06277 13.06801 13.99393 14.88652
  • The structure of this dataset gives you an idea of
    • How many comparisons you might make
    • How those are tied to hypotheses
  • Variables include
    • 4 repeated measures of DLPFC activation
    • 4 repeated measures of executive functioning
    • Two different grouping variables
  • Some hypotheses (that I just arbitrarily came up with based on the variables)
    • Are there treatment differences in DLPFC activation?
    • Are there treatment differences in executive functioning?
    • Are there treatment differences in DLPFC activation or executive functioning?
    • Are there sex differences in DLPFC activation?
    • Are there sex differences in executive functioning?
    • Are there sex differences in DLPFC activation or executive functioning?
  • How many tests would you do for each hypothesis?
    • Is it different for different hypotheses?
    • Remember that you only need to control for tests within a hypothesis, not all tests