Abstract
Part of theR for Artists and Designers
course at the
School of Foundation Studies, Srishti Manipal Institute of Art, Design,
and Technology, Bangalore.
This RMarkdown document is part of the Generic Skills Component (GSK) of the Course of the Foundation Studies Programme at Srishti Manipal Institute of Art, Design, and Technology, Bangalore India. The material is based on A Layered Grammar of Graphics by Hadley Wickham. The course is meant for First Year students pursuing a Degree in Art and Design.
The intent of this GSK part is to build Skill in coding in R, and also appreciate R as a way to metaphorically visualize information of various kinds, using predominantly geometric figures and structures.
All RMarkdown files combine code, text, web-images, and figures developed using code. Everything is text; code chunks are enclosed in fences (```)
dplyr
VERB functions to get to know and
manipulate a datasetThe method followed will be based on PRIMM:
parameters
of the code
do and write comments to explain. What bells and
whistles can you see?parameters
code provided to
understand the options
available. Write
comments to show what you have aimed for and achieved.We’ll use data from the Star Wars series of movies. May the Force be with you!!
When working with data you must:
Figure out what you want to do.
Describe those tasks in the form of a computer program.
Execute the program.
The dplyr
package makes these steps fast and easy:
By constraining your options, it helps you think about your data manipulation challenges.
It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.
This document introduces you to dplyr’s basic set of tools, and shows
you how to apply them to data frames. dplyr
also supports
databases via the dbplyr
package, once you’ve installed,
read vignette("dbplyr")
to learn more.
“Tidy Data” is an important way of thinking about what data typically
look like in R. Let’s fetch a figure from the web to show the
(preferred) structure of data in R. (The syntax to bring in a web-figure
is ![caption](url)
)
The three features described in the figure above define the nature of tidy data:
Data are imagined to be resulting from an experiment. Each variable represents a parameter/aspect in the experiment. Each row represents an additional datum of measurement. A cell is a single measurement on a single parameter(column) in a single observation(row).
To explore the basic data manipulation verbs of dplyr
,
we’ll use the dataset starwars
. This dataset contains 87
characters and comes from the Star Wars
API, and is documented in ?starwars
This means: type ?starwars
in the Console. Try.
dim(starwars)
#> [1] 87 14
starwars
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehic…¹ stars…²
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr> <chr> <chr>
#> 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr> <chr> <chr>
#> 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr> <chr> <chr>
#> 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr> <chr> <chr>
#> # … with 83 more rows, and abbreviated variable names ¹vehicles, ²starships
Note that starwars
is a tibble
, a modern
reimagining of the data frame. It’s particularly useful for large
datasets because it only prints the first few rows. You can learn more
about tibbles at https://tibble.tidyverse.org; in particular you can
convert data frames to tibbles with as_tibble()
.
Check your Environment Tab to inspect starwars
in a
separate tab.
dplyr
aims to provide a function for each basic verb of
data manipulation. These verbs can be organised into three categories
based on the component of the dataset that they work with:
filter()
chooses rows based on column values.slice()
chooses rows based on location.arrange()
changes the order of the rows.select()
changes whether or not a column is
included.rename()
changes the name of columns.mutate()
changes the values of columns and creates new
columns.relocate()
changes the order of the columns.summarise()
collapses a group into a single row.Think of the parallels from Microsoft Excel.
All of the dplyr
functions take a data frame (or
tibble
) as the first argument. Rather than forcing the user
to either save intermediate objects or nest functions, dplyr provides
the %>%
operator from magrittr
.
x %>% f(y)
turns into f(x, y)
so the result
from one step is then “piped” into the next step. You can use the pipe
to rewrite multiple operations that you can read left-to-right,
top-to-bottom (reading the pipe operator as
“then”).
filter()
filter()
allows you to select a subset of rows in a data
frame. Like all single verbs, the first argument is the tibble (or data
frame). The second and subsequent arguments refer to variables within
that data frame, selecting rows where the expression is
TRUE
.
For example, we can select all character with light skin color and brown eyes with:
Note the double equal to sign (==) below! Equivalent to MS Excel Data -> Filter
starwars %>% filter(skin_color == "light", eye_color == "brown")
#> # A tibble: 7 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homew…¹ species films vehic…² stars…³
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Leia Organa 150 49 brown light brown 19 female feminine Aldera… Human <chr> <chr> <chr>
#> 2 Biggs Darklighter 183 84 black light brown 24 male masculine Tatooi… Human <chr> <chr> <chr>
#> 3 Cordé 157 NA brown light brown NA female feminine Naboo Human <chr> <chr> <chr>
#> 4 Dormé 165 NA brown light brown NA female feminine Naboo Human <chr> <chr> <chr>
#> # … with 3 more rows, and abbreviated variable names ¹homeworld, ²vehicles, ³starships
arrange()
arrange()
works similarly to filter()
except that instead of filtering or selecting rows, it
reorders them. It takes a data frame, and a set of
column names (or more complicated expressions) to order by. If you
provide more than one column name, each additional column will be used
to break ties in the values of preceding columns:
starwars %>% arrange(height, mass)
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homew…¹ species films vehic…² stars…³
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Yoda 66 17 white green brown 896 male mascu… <NA> Yoda's… <chr> <chr> <chr>
#> 2 Ratts Tyerell 79 15 none grey, blue unknown NA male mascu… Aleen … Aleena <chr> <chr> <chr>
#> 3 Wicket Systri Warrick 88 20 brown brown brown 8 male mascu… Endor Ewok <chr> <chr> <chr>
#> 4 Dud Bolt 94 45 none blue, grey yellow NA male mascu… Vulpter Vulpte… <chr> <chr> <chr>
#> # … with 83 more rows, and abbreviated variable names ¹homeworld, ²vehicles, ³starships
Use desc()
to order a column in descending order:
starwars %>% arrange(desc(height))
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehic…¹ stars…²
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 Yarael Poof 264 NA none white yellow NA male masculine Quermia Quermian <chr [1]> <chr> <chr>
#> 2 Tarfful 234 136 brown brown blue NA male masculine Kashyyyk Wookiee <chr [1]> <chr> <chr>
#> 3 Lama Su 229 88 none grey black NA male masculine Kamino Kaminoan <chr [1]> <chr> <chr>
#> 4 Chewbacca 228 112 brown unknown blue 200 male masculine Kashyyyk Wookiee <chr [5]> <chr> <chr>
#> # … with 83 more rows, and abbreviated variable names ¹vehicles, ²starships
slice()
slice()
lets you index rows by their (integer)
locations. It allows you to select, remove, and duplicate rows.
This is an important step in Prediction, Modelling and Machine Learning.
We can get characters from row numbers 5 through 10.
starwars %>% slice(5:10)
#> # A tibble: 6 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homew…¹ species films vehic…² stars…³
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Leia Organa 150 49 brown light brown 19 female femini… Aldera… Human <chr> <chr> <chr>
#> 2 Owen Lars 178 120 brown, grey light blue 52 male mascul… Tatooi… Human <chr> <chr> <chr>
#> 3 Beru Whitesun lars 165 75 brown light blue 47 female femini… Tatooi… Human <chr> <chr> <chr>
#> 4 R5-D4 97 32 <NA> white, red red NA none mascul… Tatooi… Droid <chr> <chr> <chr>
#> # … with 2 more rows, and abbreviated variable names ¹homeworld, ²vehicles, ³starships
It is accompanied by a number of helpers for common use cases:
slice_head()
and slice_tail()
select the
first or last rows.starwars %>% slice_head(n = 3)
#> # A tibble: 3 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehic…¹ stars…²
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr> <chr> <chr>
#> 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr> <chr> <chr>
#> 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr> <chr> <chr>
#> # … with abbreviated variable names ¹vehicles, ²starships
slice_sample()
randomly selects rows. Use the option
prop to choose a certain proportion of the cases.starwars %>% slice_sample(n = 5)
#> # A tibble: 5 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehic…¹ stars…²
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Dud Bolt 94 45 none blue, grey yellow NA male masculine Vulpter Vulpter… <chr> <chr> <chr>
#> 2 Bossk 190 113 none green red 53 male masculine Trandosha Trandos… <chr> <chr> <chr>
#> 3 Shaak Ti 178 57 none red, blue, white black NA female feminine Shili Togruta <chr> <chr> <chr>
#> 4 Dormé 165 NA brown light brown NA female feminine Naboo Human <chr> <chr> <chr>
#> # … with 1 more row, and abbreviated variable names ¹vehicles, ²starships
starwars %>% slice_sample(prop = 0.1)
#> # A tibble: 8 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehic…¹ stars…²
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Qui-Gon Jinn 193 89 brown fair blue 92 male masculine <NA> Human <chr> <chr> <chr>
#> 2 Dexter Jettster 198 102 none brown yellow NA male masculine Ojom Besali… <chr> <chr> <chr>
#> 3 R4-P17 96 NA none silver, red red, blue NA none feminine <NA> Droid <chr> <chr> <chr>
#> 4 Lama Su 229 88 none grey black NA male masculine Kamino Kamino… <chr> <chr> <chr>
#> # … with 4 more rows, and abbreviated variable names ¹vehicles, ²starships
Use replace = TRUE
to perform a bootstrap sample. If
needed, you can weight the sample with the weight
argument.
Note : ` Bootstrap samples` are a special statistical sampling method. Counterintuitive perhaps, since you sample **with replacement**. Should remind you of your high school Permutation and Combination class, with all those urn models and so on. If you remember.
slice_min()
and slice_max()
select rows
with highest or lowest values of a variable. Note that we first must
choose only the values which are not NA.starwars %>%
filter(!is.na(height)) %>%
slice_min(height, n = 3)
#> # A tibble: 3 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender homew…¹ species films vehic…² stars…³
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Yoda 66 17 white green brown 896 male mascu… <NA> Yoda's… <chr> <chr> <chr>
#> 2 Ratts Tyerell 79 15 none grey, blue unknown NA male mascu… Aleen … Aleena <chr> <chr> <chr>
#> 3 Wicket Systri Warrick 88 20 brown brown brown 8 male mascu… Endor Ewok <chr> <chr> <chr>
#> # … with abbreviated variable names ¹homeworld, ²vehicles, ³starships
select()
Often you work with large datasets with many columns but only a few
are actually of interest to you. select()
allows you to
rapidly zoom in on a useful subset using operations that usually only
work on numeric variable positions:
# Select columns by name
starwars %>% select(hair_color, skin_color, eye_color)
#> # A tibble: 87 × 3
#> hair_color skin_color eye_color
#> <chr> <chr> <chr>
#> 1 blond fair blue
#> 2 <NA> gold yellow
#> 3 <NA> white, blue red
#> 4 none white yellow
#> # … with 83 more rows
# Select all columns between hair_color and eye_color (inclusive)
starwars %>% select(hair_color:eye_color)
#> # A tibble: 87 × 3
#> hair_color skin_color eye_color
#> <chr> <chr> <chr>
#> 1 blond fair blue
#> 2 <NA> gold yellow
#> 3 <NA> white, blue red
#> 4 none white yellow
#> # … with 83 more rows
# Select all columns except those from hair_color to eye_color (inclusive)
starwars %>% select(!(hair_color:eye_color))
#> # A tibble: 87 × 11
#> name height mass birth_year sex gender homeworld species films vehicles starships
#> <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 Luke Skywalker 172 77 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
#> 2 C-3PO 167 75 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
#> 3 R2-D2 96 32 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
#> 4 Darth Vader 202 136 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
#> # … with 83 more rows
# Select all columns ending with color
starwars %>% select(ends_with("color"))
#> # A tibble: 87 × 3
#> hair_color skin_color eye_color
#> <chr> <chr> <chr>
#> 1 blond fair blue
#> 2 <NA> gold yellow
#> 3 <NA> white, blue red
#> 4 none white yellow
#> # … with 83 more rows
There are a number of helper functions you can use within
select()
, like starts_with()
,
ends_with()
, matches()
and
contains()
. These let you quickly match larger blocks of
variables that meet some criterion. See ?select
for more
details.
You can even rename variables with select()
by using
named arguments:
starwars %>% select(home_world = homeworld)
#> # A tibble: 87 × 1
#> home_world
#> <chr>
#> 1 Tatooine
#> 2 Tatooine
#> 3 Naboo
#> 4 Tatooine
#> # … with 83 more rows
But because select()
drops all the variables not
explicitly mentioned, it’s not that useful. Instead, use
rename()
:
starwars %>% rename(home_world = homeworld)
#> # A tibble: 87 × 14
#> name height mass hair_color skin_color eye_color birth_year sex gender home_world species films vehic…¹ stars…²
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 Luke Skywalker 172 77 blond fair blue 19 male masculine Tatooine Human <chr> <chr> <chr>
#> 2 C-3PO 167 75 <NA> gold yellow 112 none masculine Tatooine Droid <chr> <chr> <chr>
#> 3 R2-D2 96 32 <NA> white, blue red 33 none masculine Naboo Droid <chr> <chr> <chr>
#> 4 Darth Vader 202 136 none white yellow 41.9 male masculine Tatooine Human <chr> <chr> <chr>
#> # … with 83 more rows, and abbreviated variable names ¹vehicles, ²starships
mutate()
Besides selecting sets of existing columns, it’s often useful to add
new columns that are functions of existing columns. This is the job of
mutate()
:
starwars %>% mutate(height_m = height / 100)
#> # A tibble: 87 × 15
#> name height mass hair_color skin_color eye_color birth_…¹ sex gender homew…² species films vehic…³ stars…⁴ heigh…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list> <dbl>
#> 1 Luke Skywalker 172 77 blond fair blue 19 male mascu… Tatooi… Human <chr> <chr> <chr> 1.72
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… Droid <chr> <chr> <chr> 1.67
#> 3 R2-D2 96 32 <NA> white, blue red 33 none mascu… Naboo Droid <chr> <chr> <chr> 0.96
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… Human <chr> <chr> <chr> 2.02
#> # … with 83 more rows, and abbreviated variable names ¹birth_year, ²homeworld, ³vehicles, ⁴starships, ⁵height_m
We can’t see the height in meters we just calculated, but we can fix that using a select command.
starwars %>%
mutate(height_m = height / 100) %>%
select(height_m, height, everything())
#> # A tibble: 87 × 15
#> height_m height name mass hair_color skin_color eye_color birth…¹ sex gender homew…² species films vehic…³ stars…⁴
#> <dbl> <int> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
#> 1 1.72 172 Luke Skywalker 77 blond fair blue 19 male mascu… Tatooi… Human <chr> <chr> <chr>
#> 2 1.67 167 C-3PO 75 <NA> gold yellow 112 none mascu… Tatooi… Droid <chr> <chr> <chr>
#> 3 0.96 96 R2-D2 32 <NA> white, blue red 33 none mascu… Naboo Droid <chr> <chr> <chr>
#> 4 2.02 202 Darth Vader 136 none white yellow 41.9 male mascu… Tatooi… Human <chr> <chr> <chr>
#> # … with 83 more rows, and abbreviated variable names ¹birth_year, ²homeworld, ³vehicles, ⁴starships
dplyr::mutate()
is similar to the base
transform()
, but allows you to refer to columns that you’ve
just created:
starwars %>%
mutate(
height_m = height / 100,
BMI = mass / (height_m^2)
) %>%
select(BMI, everything())
#> # A tibble: 87 × 16
#> BMI name height mass hair_color skin_c…¹ eye_c…² birth…³ sex gender homew…⁴ species films vehic…⁵ stars…⁶ heigh…⁷
#> <dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list> <dbl>
#> 1 26.0 Luke Skywalker 172 77 blond fair blue 19 male mascu… Tatooi… Human <chr> <chr> <chr> 1.72
#> 2 26.9 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi… Droid <chr> <chr> <chr> 1.67
#> 3 34.7 R2-D2 96 32 <NA> white, … red 33 none mascu… Naboo Droid <chr> <chr> <chr> 0.96
#> 4 33.3 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi… Human <chr> <chr> <chr> 2.02
#> # … with 83 more rows, and abbreviated variable names ¹skin_color, ²eye_color, ³birth_year, ⁴homeworld, ⁵vehicles, ⁶starships,
#> # ⁷height_m
If you only want to keep the new variables, use
transmute()
:
starwars %>%
transmute(
height_m = height / 100,
BMI = mass / (height_m^2)
)
#> # A tibble: 87 × 2
#> height_m BMI
#> <dbl> <dbl>
#> 1 1.72 26.0
#> 2 1.67 26.9
#> 3 0.96 34.7
#> 4 2.02 33.3
#> # … with 83 more rows
relocate()
Use a similar syntax as select()
to move blocks of
columns at once
starwars %>% relocate(sex:homeworld, .before = height)
#> # A tibble: 87 × 14
#> name sex gender homeworld height mass hair_color skin_color eye_color birth_year species films vehic…¹ stars…²
#> <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <list> <list> <list>
#> 1 Luke Skywalker male masculine Tatooine 172 77 blond fair blue 19 Human <chr> <chr> <chr>
#> 2 C-3PO none masculine Tatooine 167 75 <NA> gold yellow 112 Droid <chr> <chr> <chr>
#> 3 R2-D2 none masculine Naboo 96 32 <NA> white, blue red 33 Droid <chr> <chr> <chr>
#> 4 Darth Vader male masculine Tatooine 202 136 none white yellow 41.9 Human <chr> <chr> <chr>
#> # … with 83 more rows, and abbreviated variable names ¹vehicles, ²starships
summarise()
The last verb is summarise()
. It collapses a data frame
to a single row.
starwars %>% summarise(mean_height = mean(height, na.rm = TRUE))
#> # A tibble: 1 × 1
#> mean_height
#> <dbl>
#> 1 174.
It’s not that useful until we learn the group_by()
verb
below.
You may have noticed that the syntax and function of all these verbs are very similar:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame.
You can refer to columns in the data frame directly without using
$
.
The result is a new data frame
Together these properties make it easy to chain together multiple simple steps to achieve a complex result.
These five functions provide the basis of a language of data
manipulation. At the most basic level, you can only alter a tidy data
frame in five useful ways: you can reorder the rows
(arrange()
), pick observations and variables of interest
(filter()
and select()
), add new variables
that are functions of existing variables (mutate()
), or
collapse many values to a summary (summarise()
).
%>%
The dplyr API is functional in the sense that function calls don’t
have side-effects. dplyr provides the %>%
operator from
magrittr. x %>% f(y)
turns into f(x, y)
so
you can use it to rewrite multiple operations that you can read
left-to-right, top-to-bottom (reading the pipe operator as “then”):
starwars %>%
group_by(species, sex) %>%
summarise(
mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE)
)
#> `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
#> # A tibble: 41 × 4
#> # Groups: species [38]
#> species sex mean_height mean_mass
#> <chr> <chr> <dbl> <dbl>
#> 1 Aleena male 79 15
#> 2 Besalisk male 198 102
#> 3 Cerean male 198 82
#> 4 Chagrian male 196 NaN
#> # … with 37 more rows