Tukey advocated for and developed EDA as an adjunct to traditional statistical analyses (not a replacement)
Summarize variables in a data set
Often use visual methods
May or may not involve a statistical model
Used to formulate hypotheses that could lead to new data collection and experiments
Includes preliminary data examination prior to analysis
It doesn’t really
We do (or should do) exploratory data analysis in the service of our analyses all the time
Examining variables for normality
Transforming variables to be more normal
Examining spaghetti plots before conducting mixed models
Exploring non-linear (e.g., quadratic) relationships
There is a chapter on EDA in the “R for Data Science” book by Wickham: https://r4ds.had.co.nz/exploratory-data-analysis.html
The objectives of EDA are to:
Suggest hypotheses about the causes of observed phenomena
Assess assumptions on which statistical inference will be based
Support the selection of appropriate statistical tools and techniques
Provide a basis for further data collection through surveys or experiments
From https://en.wikipedia.org/wiki/Exploratory_data_analysis, attributed to Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131 - 160.
Exploratory analysis is ok
What’s not ok is to do CDA on the same data you did EDA on
With a few exceptions, like assessing assumptions
Don’t test a hypothesis on the same data you derived it from
Options for EDA followed by CDA:
Use two different samples
Split your sample in half: EDA on one half, followed by CDA on other half
Other splits for cross-validation: k-fold, Monte Carlo subsample, leave 1 out, leave p out
Obviously, you need a fairly large sample to be able to split it
We already have tools to do EDA
There are several packages that aim to assist in EDA, or even semi-automate it
skim function
Use on data frames, grouped data frames, vectors, matrices
Returns descriptive statistics on all variables, grouped by variable type
Output is all text
Can “pipe” the output to other things (e.g., pipe means to a plot)
vis_dat function
vis_miss function
vis_cor function
plot_missing function
plot_bar function
plot_histogram function
plot_qq, plot_correlation, plot_boxplot, plot_scatterplot functions
inspect_types
inspect_na
inspect_cor
inspect_imb
inspect_num
inspect_cat
inspect_mem
Document templates that change appearance for you automatically
papaja: https://github.com/crsh/papaja
tufte: https://rstudio.github.io/tufte/
rticles: https://bookdown.org/yihui/rmarkdown/journals.html
Many others: http://jianghao.wang/post/2017-12-08-rmarkdown-templates/
# Install devtools package if necessary
if(!"devtools" %in% rownames(installed.packages())) install.packages("devtools")
# Install the stable development verions from GitHub
devtools::install_github("crsh/papaja")
A document preparation and typesetting system
https://www.latex-project.org/
Similar to Markdown, in that you code your text as you write it, then compile to get an output document
Markdown is not the same as LaTeX, but you can use some LaTeX commands to customize your markdown documents
Inline equations surrounded by dollar signs
(Subscript with underscore, superscript with caret)
Code: For this model, the $\it R_{mult}^2 = 0.42$
Output: For this model, the \(\it R_{mult}^2 = 0.42\)
Centered, non-inline equations between \[ and
\]
(Math-specific characters like hat)
Code: \[\hat Y = b_0 + b_1 X_1 + b_2 X_2 \]
Output: \[\hat Y = b_0 + b_1 X_1 + b_2 X_2 \]
Greek letters surrounded by dollar signs
Code:
The lamba parameter, $\lambda$, represents the mean of the distribution.
Output: The lamba parameter, \(\lambda\), represents the mean of the distribution.
General formatting commands
\newpage will force a page break in a paged document
fancyhdr is a package that replicates some LaTeX functionality
http://mirror.las.iastate.edu/tex-archive/macros/latex/contrib/fancyhdr/fancyhdr.pdf
From the syllabus YAML:
header-includes:
- \usepackage{fancyhdr}
- \pagestyle{fancy}
- \fancyhead[RO,RE]{Statistical Graphics}
- \fancyhead[LO,LE]{S. Coxe}
- \fancyfoot[LE,RO]{Fall 2019}
- \fancypagestyle{plain}{\pagestyle{fancy}}
is the LaTeX equivalent to library()
E: Even page
O: Odd page
L: Left field
C: Center field
R: Right field
H: Header
F: Footer
Include an image in the upper right corner:
\fancyhead[RO,RE]{\includegraphics[width=3cm]{picture.jpg}}
Stylesheets (.sty) for Beamer presentations
From the presentation stylesheet YAML:
output:
beamer_presentation:     # indent 1 tab
theme: CoxeDiv5          # indent 2 tabs
Indentation matters for YAML
The .sty file is called beamerthemeCoxeDiv5.sty
theme: line is the part after
“beamertheme”I usually don’t create a .sty file from scratch
\definecolor{fiublack}{HTML}{181818}
\definecolor{fiuyellow}{HTML}{B6862C}
\definecolor{fiublue}{HTML}{081E3F}
\setbeamercolor*{Title bar}{fg=fiublue}
Line 34 - 35:
\titlegraphic{\includegraphics  [width=0.15\textwidth,%height=.5\textheight    ]{fiulogo_square}}
\setbeamertemplate{footline}
{
\linethickness{0.25pt}
\framelatex{
\begin{beamercolorbox}[leftskip=.3cm,sep=0.1cm]{Location bar}
\usebeamerfont{section in head/foot}
\insertshortauthor~|~\insertshorttitle
\hfill
\insertframenumber/\inserttotalframenumber
\end{beamercolorbox}}
}
Requires some data manipulation
geom_errorbar requires minimum and maximum values for the error bars
These are typically based on the standard error (of a statistic or parameter) or the standard deviation (of raw data)
Use summarize (or a similar function) to get standard deviations / errors, then supply minimum and maximum values
#glimpse(gapminder)
gapminder_summ <- gapminder %>%  
  filter(year == 2002) %>%
  group_by(continent) %>%  
  summarize(life_exp_mean = mean(lifeExp), 
              life_exp_sd = sd(lifeExp)) %>%
  ungroup()
glimpse(gapminder_summ)  ## Rows: 5
## Columns: 3
## $ continent     <fct> Africa, Americas, Asia, Europe, Oceania
## $ life_exp_mean <dbl> 53.32523, 72.42204, 69.23388, 76.70060, 79.74000
## $ life_exp_sd   <dbl> 9.5864959, 4.7997055, 8.3745954, 2.9221796, 0.8909545ggplot can calculate the actual ymin and ymax values for you
scatter_error <- 
  ggplot(data = gapminder_summ, 
    aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))bar_both <- 
  ggplot(data = gapminder_summ,
    aes(x = continent, y = life_exp_mean)) +
  geom_col(fill = "blue", alpha = 0.2) +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))Just shows the upper limit
ymax = upper limit
ymin = point estimate
Color option makes outline black so lower limit isn’t as obvious
bar_upper <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_col(color = "black", fill = "blue", alpha = 0.2) +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean, 
                  ymax = life_exp_mean + life_exp_sd))Cross bar at point estimate
Box to ymin and ymax
crossbar <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_crossbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))No caps on the ends of the error bars
linerange <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_linerange(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))No geom_point needed here
Add the point estimate in the aesthetic for the geom_pointrange function
pointrange <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_pointrange(data = gapminder_summ, 
                  aes(x = continent, y = life_exp_mean,
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))I used all summarized data (mean and SD) in these plots
You can instead plot unsummarized data
Then layer error bars on that
Each geom can have it’s own dataset
Use observed dataset in the geom_point / geom_bar call
Use summarized dataset in the error bar call