Exploratory data analysis

What is exploratory data analysis?

Tukey advocated for and developed EDA as an adjunct to traditional statistical analyses (not a replacement)

Summarize variables in a data set

Often use visual methods

May or may not involve a statistical model

Used to formulate hypotheses that could lead to new data collection and experiments

We want to be surprised

Includes preliminary data examination prior to analysis

Look at distributions, nonlinear relationships, etc.

How does EDA differ from what we normally do?

It doesn’t really

We do (or should do) exploratory data analysis in the service of our analyses all the time

Examining variables for normality
Transforming variables to be more normal
Examining spaghetti plots before conducting mixed models
Exploring non-linear (e.g., quadratic) relationships

There is a chapter on EDA in the “R for Data Science” book by Wickham: https://r4ds.had.co.nz/exploratory-data-analysis.html

Objectives of EDA

The objectives of EDA are to:

Suggest hypotheses about the causes of observed phenomena
Assess assumptions on which statistical inference will be based
Support the selection of appropriate statistical tools and techniques
Provide a basis for further data collection through surveys or experiments

From https://en.wikipedia.org/wiki/Exploratory_data_analysis, attributed to Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131 - 160.

Exploratory versus confirmatory

Exploratory analysis is ok

What’s not ok is to do CDA on the same data you did EDA on
With a few exceptions, like assessing assumptions
Don’t test a hypothesis on the same data you derived it from

Options for EDA followed by CDA:

Use two different samples
Split your sample in half: EDA on one half, followed by CDA on other half
Other splits for cross-validation: k-fold, Monte Carlo subsample, leave 1 out, leave p out

Obviously, you need a fairly large sample to be able to split it

Exploration tools

We already have tools to do EDA

Plots, glimpse / head / etc., summary statistics

There are several packages that aim to assist in EDA, or even semi-automate it

skimr

skim function

Use on data frames, grouped data frames, vectors, matrices
Returns descriptive statistics on all variables, grouped by variable type
Output is all text
Can “pipe” the output to other things (e.g., pipe means to a plot)

visdat

vis_dat function

Use on a data frame or variable. Returns a plot of value types (including missing)

vis_miss function

Use on a data frame or variable. Returns a plot of missing versus present

vis_cor function

Use on a data frame. Returns a graphical correlation matrix

DataExplorer

plot_missing function

Graphical percent missing for all variables in a data frame

plot_bar function

Values for categorical variables

plot_histogram function

Values for continuous variables

plot_qq, plot_correlation, plot_boxplot, plot_scatterplot functions

Do what it sounds like they do

inspectdf

inspect_types

Plot of variable types in a data frame

inspect_na

Plot of missing value percentages for each variable

inspect_cor

Cool plot of all correlations between pairs of variables

inspect_imb

Graph of most common response for categorical variables

inspectdf

inspect_num

Distributions of numeric variables

inspect_cat

Distributions of categorical variables

inspect_mem

Information on memory usage of different variables

Customizing documents

Templates

Document templates that change appearance for you automatically

papaja: https://github.com/crsh/papaja

tufte: https://rstudio.github.io/tufte/

rticles: https://bookdown.org/yihui/rmarkdown/journals.html

Many others: http://jianghao.wang/post/2017-12-08-rmarkdown-templates/

Once you’ve installed the package, you can choose the template from File -> New File -> R Markdown -> From Template

papaja package has an APA (v6) article template

# Install devtools package if necessary

if(!"devtools" %in% rownames(installed.packages())) install.packages("devtools")

# Install the stable development verions from GitHub

devtools::install_github("crsh/papaja")

Using commands from

A document preparation and typesetting system

https://www.latex-project.org/

Similar to Markdown, in that you code your text as you write it, then compile to get an output document

Markdown is not the same as LaTeX, but you can use some LaTeX commands to customize your markdown documents

The reproducible presentation I showed you used a LaTeX stylesheet to change the appearance of the presentation (a LaTeX beamer presentation)

Using commands from

Inline equations surrounded by dollar signs

(Subscript with underscore, superscript with caret)

Code: For this model, the $\it R_{mult}^2 = 0.42$

Output: For this model, the $\it R_{mult}^2 = 0.42$

Centered, non-inline equations between \[ and \]

(Math-specific characters like hat)

Code: \[\hat Y = b_0 + b_1 X_1 + b_2 X_2 \]

Output: \[\hat Y = b_0 + b_1 X_1 + b_2 X_2 \]

Using commands from

Greek letters surrounded by dollar signs

Code: The lamba parameter, $\lambda$, represents the mean of the distribution.

Output: The lamba parameter, $\lambda$, represents the mean of the distribution.

General formatting commands

\newpage will force a page break in a paged document

Customize paged documents

fancyhdr is a package that replicates some LaTeX functionality

http://mirror.las.iastate.edu/tex-archive/macros/latex/contrib/fancyhdr/fancyhdr.pdf

From the syllabus YAML:

header-includes:
- \usepackage{fancyhdr}
- \pagestyle{fancy}
- \fancyhead[RO,RE]{Statistical Graphics}
- \fancyhead[LO,LE]{S. Coxe}
- \fancyfoot[LE,RO]{Fall 2019}
- \fancypagestyle{plain}{\pagestyle{fancy}}

Customize paged documents

is the LaTeX equivalent to library()

E: Even page
O: Odd page
L: Left field
C: Center field
R: Right field
H: Header
F: Footer

Include an image in the upper right corner: \fancyhead[RO,RE]{\includegraphics[width=3cm]{picture.jpg}}

Customize presentations

Stylesheets (.sty) for Beamer presentations

From the presentation stylesheet YAML:

output:
beamer_presentation: # indent 1 tab
theme: CoxeDiv5 # indent 2 tabs

Indentation matters for YAML

The .sty file is called beamerthemeCoxeDiv5.sty

The name in the theme: line is the part after “beamertheme”

Customize presentations

I usually don’t create a .sty file from scratch

There are a lot of things you can change that you probably don’t care about, so just find one and modify the parts you want to change

\definecolor{fiublack}{HTML}{181818} \definecolor{fiuyellow}{HTML}{B6862C} \definecolor{fiublue}{HTML}{081E3F}

Define some colors so you can use their nicknames instead of the hex code

Customize presentations

\setbeamercolor*{Title bar}{fg=fiublue}

The bar across the top of the slides is FIU blue

Line 34 - 35: \titlegraphic{\includegraphics [width=0.15\textwidth,%height=.5\textheight ]{fiulogo_square}}

Puts the FIU logo file on the title slide

Customize presentations

\setbeamertemplate{footline}
{
\linethickness{0.25pt}
\framelatex{
\begin{beamercolorbox}[leftskip=.3cm,sep=0.1cm]{Location bar}
\usebeamerfont{section in head/foot}
\insertshortauthor~|~\insertshorttitle
\hfill
\insertframenumber/\inserttotalframenumber
\end{beamercolorbox}}
}

Include author name, presentation title, current slide number, and total number of slides in the footer of each slide

Error bars

Adding error bars to plots

Requires some data manipulation

geom_errorbar requires minimum and maximum values for the error bars
These are typically based on the standard error (of a statistic or parameter) or the standard deviation (of raw data)

Use summarize (or a similar function) to get standard deviations / errors, then supply minimum and maximum values

For example, $\pm$ 1 SD

Summarize – get the means and SDs by continent

#glimpse(gapminder)
gapminder_summ <- gapminder %>%  
  filter(year == 2002) %>%
  group_by(continent) %>%  
  summarize(life_exp_mean = mean(lifeExp), 
              life_exp_sd = sd(lifeExp)) %>%
  ungroup()
glimpse(gapminder_summ)

## Rows: 5
## Columns: 3
## $ continent     <fct> Africa, Americas, Asia, Europe, Oceania
## $ life_exp_mean <dbl> 53.32523, 72.42204, 69.23388, 76.70060, 79.74000
## $ life_exp_sd   <dbl> 9.5864959, 4.7997055, 8.3745954, 2.9221796, 0.8909545

Make the plot - Scatterplot

ggplot can calculate the actual ymin and ymax values for you

scatter_error <- 
  ggplot(data = gapminder_summ, 
    aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot

Make the plot - Bar plot

bar_both <- 
  ggplot(data = gapminder_summ,
    aes(x = continent, y = life_exp_mean)) +
  geom_col(fill = "blue", alpha = 0.2) +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Bar plot

Make the plot - Bar plot (just show the upper error bar)

Just shows the upper limit

ymax = upper limit
ymin = point estimate

Color option makes outline black so lower limit isn’t as obvious

bar_upper <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_col(color = "black", fill = "blue", alpha = 0.2) +
  geom_errorbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Bar plot (just show the upper error bar)

Make the plot - Scatterplot (with crossbar error bars)

Cross bar at point estimate

Box to ymin and ymax

crossbar <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_crossbar(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot (with crossbar error bars)

Make the plot - Scatterplot (with line range error bars)

No caps on the ends of the error bars

linerange <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_point() +
  geom_linerange(data = gapminder_summ, 
                  aes(x = continent, 
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot (with line range error bars)

Make the plot - Scatterplot (with point range error bars)

No geom_point needed here

Add the point estimate in the aesthetic for the geom_pointrange function

pointrange <- 
  ggplot(data = gapminder_summ, 
         aes(x = continent, y = life_exp_mean)) +
  geom_pointrange(data = gapminder_summ, 
                  aes(x = continent, y = life_exp_mean,
                  ymin = life_exp_mean - life_exp_sd, 
                  ymax = life_exp_mean + life_exp_sd))

Make the plot - Scatterplot (with point range error bars)

Some last thoughts about error bars

I used all summarized data (mean and SD) in these plots

You can instead plot unsummarized data
Then layer error bars on that

Each geom can have it’s own dataset

Use observed dataset in the geom_point / geom_bar call
Use summarized dataset in the error bar call

Statistical Graphics and Communication

Stefany Coxe

October 3, 2019

Exploratory data analysis

What is exploratory data analysis?

How does EDA differ from what we normally do?

Objectives of EDA

Exploratory versus confirmatory

Exploration tools

skimr

visdat

DataExplorer

inspectdf

inspectdf

Customizing documents

Templates

papaja package has an APA (v6) article template

Using commands from

Using commands from

Using commands from

Customize paged documents

Customize paged documents

Customize presentations

Customize presentations

Customize presentations

Customize presentations

Error bars

Adding error bars to plots

Summarize – get the means and SDs by continent

Make the plot - Scatterplot

Make the plot - Scatterplot

Make the plot - Bar plot

Make the plot - Bar plot

Make the plot - Bar plot (just show the upper error bar)

Make the plot - Bar plot (just show the upper error bar)

Make the plot - Scatterplot (with crossbar error bars)

Make the plot - Scatterplot (with crossbar error bars)

Make the plot - Scatterplot (with line range error bars)

Make the plot - Scatterplot (with line range error bars)

Make the plot - Scatterplot (with point range error bars)

Make the plot - Scatterplot (with point range error bars)

Some last thoughts about error bars