read.csv()
(built-in)read_excel()
function from readxl packageNote
gap
– much shorter to type.C:\Users\stefany\Desktop
(Windows) or Users/stefany/Desktop
(Mac)\
s can cause problems because they’re “escape” characters. If you get an error about that, just switch the \
s to /
s country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
Note
read_sav
function from the haven packagegap_spss1
so it wouldn’t overwrite the CSV version that was just called gap
.# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl>
1 1 [Afghanistan] 3 [Asia] 1952 28.8 8425333 779.
2 1 [Afghanistan] 3 [Asia] 1957 30.3 9240934 821.
3 1 [Afghanistan] 3 [Asia] 1962 32.0 10267083 853.
4 1 [Afghanistan] 3 [Asia] 1967 34.0 11537966 836.
5 1 [Afghanistan] 3 [Asia] 1972 36.1 13079460 740.
6 1 [Afghanistan] 3 [Asia] 1977 38.4 14880372 786.
Note
country
and continent
variables are type dbl+lbl
: dbl
is the number, lbl
is the actual name label.Note
read.spss
function from the foreign packagegap_spss2
so it wouldn’t overwrite the CSV version that was just called gap
.to.data.frame = TRUE
forces the file into a data frame (in case it wasn’t already)use.value.labels = TRUE
uses the variable labels from SPSS (ignoring the numbers) country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
Note
country
and continent
are now the labels, not the numbers.head()
, glimpse()
, str()
, etc.gapminder
dataset country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
11 Afghanistan Asia 2002 42.129 25268405 726.7341
12 Afghanistan Asia 2007 43.828 31889923 974.5803
13 Albania Europe 1952 55.230 1282697 1601.0561
14 Albania Europe 1957 59.280 1476505 1942.2842
15 Albania Europe 1962 64.820 1728137 2312.8890
pivot_wider()
function
table2
is the dataset you start from (in this case, the tall one)names_from
is the variable that defines the repeated measurevalues_from
is the variable you are interested in “widening”
gapminder
dataset country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
11 Afghanistan Asia 2002 42.129 25268405 726.7341
12 Afghanistan Asia 2007 43.828 31889923 974.5803
13 Albania Europe 1952 55.230 1282697 1601.0561
14 Albania Europe 1957 59.280 1476505 1942.2842
15 Albania Europe 1962 64.820 1728137 2312.8890
year
defines the repeated measurelifeExp
, pop
, and gdpPercap
into one column per year
pivot_wider()
: Tall to widegap_wide
<-
) the data frame called gap
%>%
) gap
into the pivot_wider()
functionpivot_wider()
has 2 arguments (plus the data)
names_from = year
values_from = c(lifeExp, pop, gdpPercap)
pivot_wider()
: Now wide# A tibble: 6 × 38
country continent lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 28.8 30.3 32.0 34.0
2 Albania Europe 55.2 59.3 64.8 66.2
3 Algeria Africa 43.1 45.7 48.3 51.4
4 Angola Africa 30.0 32.0 34 36.0
5 Argentina Americas 62.5 64.4 65.1 65.6
6 Australia Oceania 69.1 70.3 70.9 71.1
# ℹ 32 more variables: lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
# lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
# lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, pop_1952 <int>,
# pop_1957 <int>, pop_1962 <int>, pop_1967 <int>, pop_1972 <int>,
# pop_1977 <int>, pop_1982 <int>, pop_1987 <int>, pop_1992 <int>,
# pop_1997 <int>, pop_2002 <int>, pop_2007 <int>, gdpPercap_1952 <dbl>,
# gdpPercap_1957 <dbl>, gdpPercap_1962 <dbl>, gdpPercap_1967 <dbl>, …
pivot_longer()
: Wide to tallNote
pivot_longer()
: Now tall(ish)# A tibble: 5,112 × 4
country continent var_year value
<chr> <chr> <chr> <dbl>
1 Afghanistan Asia lifeExp_1952 28.8
2 Afghanistan Asia lifeExp_1957 30.3
3 Afghanistan Asia lifeExp_1962 32.0
4 Afghanistan Asia lifeExp_1967 34.0
5 Afghanistan Asia lifeExp_1972 36.1
6 Afghanistan Asia lifeExp_1977 38.4
7 Afghanistan Asia lifeExp_1982 39.9
8 Afghanistan Asia lifeExp_1987 40.8
9 Afghanistan Asia lifeExp_1992 41.7
10 Afghanistan Asia lifeExp_1997 41.8
11 Afghanistan Asia lifeExp_2002 42.1
12 Afghanistan Asia lifeExp_2007 43.8
13 Afghanistan Asia pop_1952 8425333
14 Afghanistan Asia pop_1957 9240934
15 Afghanistan Asia pop_1962 10267083
16 Afghanistan Asia pop_1967 11537966
17 Afghanistan Asia pop_1972 13079460
18 Afghanistan Asia pop_1977 14880372
19 Afghanistan Asia pop_1982 12881816
20 Afghanistan Asia pop_1987 13867957
21 Afghanistan Asia pop_1992 16317921
22 Afghanistan Asia pop_1997 22227415
23 Afghanistan Asia pop_2002 25268405
24 Afghanistan Asia pop_2007 31889923
25 Afghanistan Asia gdpPercap_1952 779.
26 Afghanistan Asia gdpPercap_1957 821.
27 Afghanistan Asia gdpPercap_1962 853.
28 Afghanistan Asia gdpPercap_1967 836.
29 Afghanistan Asia gdpPercap_1972 740.
30 Afghanistan Asia gdpPercap_1977 786.
# ℹ 5,082 more rows
Note
pivot_wider()
and pivot_longer()
are a little bespoke to your particular dataset
names_from
to create the new variables, but there are a lot of other options, depending on how your dataset is set uppivot_wider()
or pivot_longer()
successfully on the first tryvar_year
variable has 2 pieces of information:
lifeExp
, pop
, gpdPercap
year
variable: 1952, 1957, etc.separate()
: multiple columnsseparate_longer_delim()
: multiple rowsseparate()
separate_wider_delim()
var_year
is the variable we want to splitsep
is the separation character – here, underscore (_)into
are the names of the new variables: variable
and year
separate()
# A tibble: 5,112 × 5
country continent variable year value
<chr> <chr> <chr> <chr> <dbl>
1 Afghanistan Asia lifeExp 1952 28.8
2 Afghanistan Asia lifeExp 1957 30.3
3 Afghanistan Asia lifeExp 1962 32.0
4 Afghanistan Asia lifeExp 1967 34.0
5 Afghanistan Asia lifeExp 1972 36.1
6 Afghanistan Asia lifeExp 1977 38.4
7 Afghanistan Asia lifeExp 1982 39.9
8 Afghanistan Asia lifeExp 1987 40.8
9 Afghanistan Asia lifeExp 1992 41.7
10 Afghanistan Asia lifeExp 1997 41.8
# ℹ 5,102 more rows
unite()
century
: 19 or 20year
: 00 to 99year
variable: 1998, 1999, 2000, 2001unite()
site
: 1, 2, 3, 4ID
: 001, 002, 003, etc.newID
variable: 1001, 1002, 1003, 2001, 2002Note
The tidyr functions for missing values might be useful for something. Dropping missing values or replacing them with a specific value is NOT a good choice in experimental data. We don’t have time to go in much depth with this but check the References for more info on handling missing data.
NA
.
NA
NA
NA
mutate()
: Create new variablesgdp
variable from gdpPercap
and pop
mutate()
statementmutate()
: Create new variables country continent year lifeExp pop gdpPercap gdp
1 Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
2 Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
3 Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
4 Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
5 Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
6 Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231
select()
: Select columns (variables)country
, continent
, year
, and lifeExp
c()
functionstarts_with()
, ends_with()
, contains()
select()
: Select columnsfilter()
: Select rows&
, any criteria with |
, not this criteria with !
<
, >
, <=
, >=
, etc.filter()
: Select rows country continent year lifeExp pop gdpPercap
1 Afghanistan Asia 1952 28.801 8425333 779.4453
2 Afghanistan Asia 1957 30.332 9240934 820.8530
3 Afghanistan Asia 1962 31.997 10267083 853.1007
4 Afghanistan Asia 1967 34.020 11537966 836.1971
5 Afghanistan Asia 1972 36.088 13079460 739.9811
6 Afghanistan Asia 1977 38.438 14880372 786.1134
7 Afghanistan Asia 1982 39.854 12881816 978.0114
8 Afghanistan Asia 1987 40.822 13867957 852.3959
9 Afghanistan Asia 1992 41.674 16317921 649.3414
10 Afghanistan Asia 1997 41.763 22227415 635.3414
11 Afghanistan Asia 2002 42.129 25268405 726.7341
12 Afghanistan Asia 2007 43.828 31889923 974.5803
13 Bahrain Asia 1952 50.939 120447 9867.0848
14 Bahrain Asia 1957 53.832 138655 11635.7995
15 Bahrain Asia 1962 56.923 171863 12753.2751
summarise()
: Summarize datamean()
, sd()
, n()
summarise()
: Summarize datalifeExp
arrange()
: Change row orderdesc()
arrange()
: Change row order country continent year lifeExp pop gdpPercap
1 Rwanda Africa 1992 23.599 7290203 737.0686
2 Afghanistan Asia 1952 28.801 8425333 779.4453
3 Gambia Africa 1952 30.000 284320 485.2307
4 Angola Africa 1952 30.015 4232095 3520.6103
5 Sierra Leone Africa 1952 30.331 2143249 879.7877
6 Afghanistan Asia 1957 30.332 9240934 820.8530
group_by()
: Separately by groupNote
ungroup()
after you do whatever you need the group_by()
forgroup_by()
: Separately by group