BTS 510 Lab 2

set.seed(12345)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Note
  • Note that I’m loading all the packages at the top here
    • library(tidyverse) will load all packages in the “tidyverse”, which includes dplyr and tidyr as well as ggplot2 and some others

1 Learning objectives

  • Read data into R from external formats (e.g., Excel, .csv, SPSS)
  • Use dplyr functions in R to manipulate datasets (e.g., create new variables)
  • Use tidyr functions in R to organize and re-structure datasets

2 Read in data

  • Read in the “gapminder.csv” file
    • You’ll need to either
      • Save it in the same folder as your .qmd file OR
      • Supply the entire path to the file
    • I have it saved in the same folder, so I’ll just give the file name
gap <- read.csv("gapminder.csv")
  • Because I already looked at the data, I know that it read in a couple variables differently from how the original data frame was set up
    • Specifically, the country and continent variables were originally “factor” variables but have read in here as “character” variables
    • So I’m going to force them to be factor variables using the as.factor() function
    • Among other things, this will make some of the “data view” functions show us the number of levels of the variable
gap$country <- as.factor(gap$country)
gap$continent <- as.factor(gap$continent)

2.1 Check out the data

  • head() function
head(gap)
      country continent year lifeExp      pop gdpPercap
1 Afghanistan      Asia 1952  28.801  8425333  779.4453
2 Afghanistan      Asia 1957  30.332  9240934  820.8530
3 Afghanistan      Asia 1962  31.997 10267083  853.1007
4 Afghanistan      Asia 1967  34.020 11537966  836.1971
5 Afghanistan      Asia 1972  36.088 13079460  739.9811
6 Afghanistan      Asia 1977  38.438 14880372  786.1134
  • str() function
str(gap)
'data.frame':   1704 obs. of  6 variables:
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...
  • glimpse() function
glimpse(gap)
Rows: 1,704
Columns: 6
$ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
  • Data looks like it read in correctly
    • How many countries?
    • How many years?
    • How many variables (besides country, continent, and year)?

3 Convert from tall to wide and back successfully

  • Part of the issue in the lecture with going back and forth easily is that we were converting multiple variables
    • This made it too complicated
      • (Although, of course, that’s what real data is like…)
    • So let’s make this a little simpler and use only 1 variable: lifeExp

3.1 Step 1: select() only some variables (lifeExp)

gap1 <- gap %>% select(country, continent, year, lifeExp)
head(gap1)
      country continent year lifeExp
1 Afghanistan      Asia 1952  28.801
2 Afghanistan      Asia 1957  30.332
3 Afghanistan      Asia 1962  31.997
4 Afghanistan      Asia 1967  34.020
5 Afghanistan      Asia 1972  36.088
6 Afghanistan      Asia 1977  38.438

3.2 Step 2: pivot_wider() to convert to a wide dataset

gap1_wide <- gap1 %>% pivot_wider(names_from = year, 
                                  values_from = lifeExp)
head(gap1_wide)
# A tibble: 6 × 14
  country     continent `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987`
  <fct>       <fct>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 Afghanistan Asia        28.8   30.3   32.0   34.0   36.1   38.4   39.9   40.8
2 Albania     Europe      55.2   59.3   64.8   66.2   67.7   68.9   70.4   72  
3 Algeria     Africa      43.1   45.7   48.3   51.4   54.5   58.0   61.4   65.8
4 Angola      Africa      30.0   32.0   34     36.0   37.9   39.5   39.9   39.9
5 Argentina   Americas    62.5   64.4   65.1   65.6   67.1   68.5   69.9   70.8
6 Australia   Oceania     69.1   70.3   70.9   71.1   71.9   73.5   74.7   76.3
# ℹ 4 more variables: `1992` <dbl>, `1997` <dbl>, `2002` <dbl>, `2007` <dbl>
  • How many years of data are in this dataset?

3.3 Step 3: pivot_longer() to convert back to a tall dataset

gap1_tall <- gap1_wide %>% pivot_longer(cols = 3:14, 
                                        names_to = "year", 
                                        values_to = "lifeExp")
head(gap1_tall)
# A tibble: 6 × 4
  country     continent year  lifeExp
  <fct>       <fct>     <chr>   <dbl>
1 Afghanistan Asia      1952     28.8
2 Afghanistan Asia      1957     30.3
3 Afghanistan Asia      1962     32.0
4 Afghanistan Asia      1967     34.0
5 Afghanistan Asia      1972     36.1
6 Afghanistan Asia      1977     38.4
  • Beautiful!
    • Though notice that R decided that year is now a character variable…
    • If you wanted to, you could change that with the as.numeric() function as we did to convert variables to factors

4 Manipulate some data

  • For each task, start with the original data, gap, and use appropriate functions (e.g., print(), head(), str(), glimpse()) to check that you have the data you were trying to get.
  1. Keep only observations from countries in Asia and Europe.

  2. Create a new variable that is the ratio of life expectancy to GDP.

  3. Create a new data frame with the mean GDP per capita for each continent.

  4. Find the country in the Americas with the largest population. What country was it and what year was it?

  5. Keep only data from Asia prior to 1975 and sort it according to population.

  6. How many countries have a life expectancy greater than 80 in 2007?