Exploratory data analysis

What is exploratory data analysis?

Tukey advocated for and developed EDA as an adjunct to traditional statistical analyses (not a replacement)

Summarize variables in a data set

Often use visual methods

May or may not involve a statistical model

Used to formulate hypotheses that could lead to new data collection and experiments

  • We want to be surprised

Includes preliminary data examination prior to analysis

  • Look at distributions, nonlinear relationships, etc.

How does EDA differ from what we normally do?

It doesn’t really

We do (or should do) exploratory data analysis in the service of our analyses all the time

  • Examining variables for normality

  • Transforming variables to be more normal

  • Examining spaghetti plots before conducting mixed models

  • Exploring non-linear (e.g., quadratic) relationships

There is a chapter on EDA in the “R for Data Science” book by Wickham: https://r4ds.had.co.nz/exploratory-data-analysis.html

Objectives of EDA

The objectives of EDA are to:

  • Suggest hypotheses about the causes of observed phenomena

  • Assess assumptions on which statistical inference will be based

  • Support the selection of appropriate statistical tools and techniques

  • Provide a basis for further data collection through surveys or experiments

From https://en.wikipedia.org/wiki/Exploratory_data_analysis, attributed to Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131 - 160.

Exploratory versus confirmatory

Exploratory analysis is ok

  • What’s not ok is to do CDA on the same data you did EDA on

  • With a few exceptions, like assessing assumptions

  • Don’t test a hypothesis on the same data you derived it from

Options for EDA followed by CDA:

  • Use two different samples

  • Split your sample in half: EDA on one half, followed by CDA on other half

  • Other splits for cross-validation: k-fold, Monte Carlo subsample, leave 1 out, leave p out

Obviously, you need a fairly large sample to be able to split it

Exploration tools

We already have tools to do EDA

  • Plots, glimpse / head / etc., summary statistics

There are several packages that aim to assist in EDA, or even semi-automate it

skimr

skim function

  • Use on data frames, grouped data frames, vectors, matrices

  • Returns descriptive statistics on all variables, grouped by variable type

  • Output is all text

  • Can “pipe” the output to other things (e.g., pipe means to a plot)

visdat

vis_dat function

  • Use on a data frame or variable. Returns a plot of value types (including missing)

vis_miss function

  • Use on a data frame or variable. Returns a plot of missing versus present

vis_cor function

  • Use on a data frame. Returns a graphical correlation matrix

DataExplorer

plot_missing function

  • Graphical percent missing for all variables in a data frame

plot_bar function

  • Values for categorical variables

plot_histogram function

  • Values for continuous variables

plot_qq, plot_correlation, plot_boxplot, plot_scatterplot functions

  • Do what it sounds like they do

inspectdf

inspect_types

  • Plot of variable types in a data frame

inspect_na

  • Plot of missing value percentages for each variable

inspect_cor

  • Cool plot of all correlations between pairs of variables

inspect_imb

  • Graph of most common response for categorical variables

inspectdf

inspect_num

  • Distributions of numeric variables

inspect_cat

  • Distributions of categorical variables

inspect_mem

  • Information on memory usage of different variables

Customizing documents

Templates

Document templates that change appearance for you automatically

papaja: https://github.com/crsh/papaja

tufte: https://rstudio.github.io/tufte/

rticles: https://bookdown.org/yihui/rmarkdown/journals.html

Many others: http://jianghao.wang/post/2017-12-08-rmarkdown-templates/

  • Once you’ve installed the package, you can choose the template from File -> New File -> R Markdown -> From Template

papaja package has an APA (v6) article template

# Install devtools package if necessary

if(!"devtools" %in% rownames(installed.packages())) install.packages("devtools")

# Install the stable development verions from GitHub

devtools::install_github("crsh/papaja")

Using commands from

A document preparation and typesetting system

https://www.latex-project.org/

Similar to Markdown, in that you code your text as you write it, then compile to get an output document

Markdown is not the same as LaTeX, but you can use some LaTeX commands to customize your markdown documents

  • The reproducible presentation I showed you used a LaTeX stylesheet to change the appearance of the presentation (a LaTeX beamer presentation)

Using commands from

Inline equations surrounded by dollar signs

(Subscript with underscore, superscript with caret)

Code: For this model, the $\it R_{mult}^2 = 0.42$

Output: For this model, the \(\it R_{mult}^2 = 0.42\)

Centered, non-inline equations between \[ and \]

(Math-specific characters like hat)

Code: \[\hat Y = b_0 + b_1 X_1 + b_2 X_2 \]

Output: \[\hat Y = b_0 + b_1 X_1 + b_2 X_2 \]

Using commands from

Greek letters surrounded by dollar signs

Code: The lamba parameter, $\lambda$, represents the mean of the distribution.

Output: The lamba parameter, \(\lambda\), represents the mean of the distribution.

General formatting commands

\newpage will force a page break in a paged document

Customize paged documents

fancyhdr is a package that replicates some LaTeX functionality

http://mirror.las.iastate.edu/tex-archive/macros/latex/contrib/fancyhdr/fancyhdr.pdf

From the syllabus YAML:

header-includes:
- \usepackage{fancyhdr}
- \pagestyle{fancy}
- \fancyhead[RO,RE]{Statistical Graphics}
- \fancyhead[LO,LE]{S. Coxe}
- \fancyfoot[LE,RO]{Fall 2019}
- \fancypagestyle{plain}{\pagestyle{fancy}}

Customize paged documents

is the LaTeX equivalent to library()

E: Even page
O: Odd page
L: Left field
C: Center field
R: Right field
H: Header
F: Footer

Include an image in the upper right corner: \fancyhead[RO,RE]{\includegraphics[width=3cm]{picture.jpg}}

Customize presentations

Stylesheets (.sty) for Beamer presentations

From the presentation stylesheet YAML:

output:
beamer_presentation: # indent 1 tab
theme: CoxeDiv5 # indent 2 tabs

Indentation matters for YAML

The .sty file is called beamerthemeCoxeDiv5.sty

  • The name in the theme: line is the part after “beamertheme”

Customize presentations

I usually don’t create a .sty file from scratch

  • There are a lot of things you can change that you probably don’t care about, so just find one and modify the parts you want to change

\definecolor{fiublack}{HTML}{181818} \definecolor{fiuyellow}{HTML}{B6862C} \definecolor{fiublue}{HTML}{081E3F}

  • Define some colors so you can use their nicknames instead of the hex code

Customize presentations

\setbeamercolor*{Title bar}{fg=fiublue}

  • The bar across the top of the slides is FIU blue

Line 34 - 35: \titlegraphic{\includegraphics [width=0.15\textwidth,%height=.5\textheight ]{fiulogo_square}}

  • Puts the FIU logo file on the title slide

Customize presentations

\setbeamertemplate{footline}
{
\linethickness{0.25pt}
\framelatex{
\begin{beamercolorbox}[leftskip=.3cm,sep=0.1cm]{Location bar}
\usebeamerfont{section in head/foot}
\insertshortauthor~|~\insertshorttitle
\hfill
\insertframenumber/\inserttotalframenumber
\end{beamercolorbox}}
}

  • Include author name, presentation title, current slide number, and total number of slides in the footer of each slide

Error bars

Adding error bars to plots

Requires some data manipulation

  • geom_errorbar requires minimum and maximum values for the error bars

  • These are typically based on the standard error (of a statistic or parameter) or the standard deviation (of raw data)

Use summarize (or a similar function) to get standard deviations / errors, then supply minimum and maximum values

  • For example, \(\pm\) 1 SD

Summarize – get the means and SDs by continent

#glimpse(gapminder)
gapminder_summ <- gapminder %>%  
  filter(year == 2002) %>%
  group_by(continent) %>%  
  summarize(life_exp_mean = mean(lifeExp), 
              life_exp_sd = sd(lifeExp)) %>%
  ungroup()
glimpse(gapminder_summ)  
## Rows: 5
## Columns: 3
## $ continent     <fct> Africa, Americas, Asia, Europe, Oceania
## $ life_exp_mean <dbl> 53.32523, 72.42204, 69.23388, 76.70060, 79.74000
## $ life_exp_sd   <dbl> 9.5864959, 4.7997055, 8.3745954, 2.9221796, 0.8909545

Make the plot - Scatterplot

ggplot can calculate the actual ymin and ymax values for you

scatter_error <- 
  ggplot(data = gapminder_summ, 
    aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot

Make the plot - Bar plot

bar_both <- 
  ggplot(data = gapminder_summ,
    aes(x = continent, y = life_exp_mean)) +
  geom_col(fill = "blue", alpha = 0.2) +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Bar plot

Make the plot - Bar plot (just show the upper error bar)

Just shows the upper limit

  • ymax = upper limit

  • ymin = point estimate

Color option makes outline black so lower limit isn’t as obvious

bar_upper <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_col(color = "black", fill = "blue", alpha = 0.2) +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Bar plot (just show the upper error bar)

Make the plot - Scatterplot (with crossbar error bars)

Cross bar at point estimate

Box to ymin and ymax

crossbar <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_crossbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot (with crossbar error bars)

Make the plot - Scatterplot (with line range error bars)

No caps on the ends of the error bars

linerange <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_linerange(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot (with line range error bars)

Make the plot - Scatterplot (with point range error bars)

No geom_point needed here

Add the point estimate in the aesthetic for the geom_pointrange function

pointrange <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_pointrange(data = gapminder_summ, 
                  aes(x = continent, y = life_exp_mean,
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot (with point range error bars)

Some last thoughts about error bars

I used all summarized data (mean and SD) in these plots

  • You can instead plot unsummarized data

  • Then layer error bars on that

Each geom can have it’s own dataset

  • Use observed dataset in the geom_point / geom_bar call

  • Use summarized dataset in the error bar call