Components of the
layered grammar of graphics
Layers are used to create the objects on a plot.
They are defined by five basic parts:
- Data (What dataset/spreadsheet am I using?)
- Mapping (What does each column do in my graph?)
- Statistical transformation (stat) (Do I have count something
first?)
- Geometric object (geom) (What shape, colour, size…do I want?)
- Position adjustment (position) (Where do I want it on the
graph?)
Data
We will use “real world” data. Let’s use the penguins
dataset in the palmerpenguins
package. Run
?penguins
in the console to get more information about this
dataset.
Head
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
## 3 Adelie Torgersen 40.3 18 195 3250 female 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
Tail
tail(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Chinstrap Dream 45.7 17 195 3650 female 2009
## 2 Chinstrap Dream 55.8 19.8 207 4000 male 2009
## 3 Chinstrap Dream 43.5 18.1 202 3400 female 2009
## 4 Chinstrap Dream 49.6 18.2 193 3775 male 2009
## 5 Chinstrap Dream 50.8 19 210 4100 male 2009
## 6 Chinstrap Dream 50.2 18.7 198 3775 female 2009
Dim
dim(penguins)
## [1] 344 8
So we know what our data looks like. We pass this data to
ggplot
use to plot as follows: in R this creates an empty
graph sheet!! Because we have not (yet) declared the geometric shapes we
want to use to plot our information.
ggplot(data = penguins) # Creates an empty graphsheet, ready for plotting!!
Mapping
Now that we have told R what data to use, we need to state what
variables to plot and how.
Aesthetic Mapping defines how the variables are
applied to the plot, i.e. we take a variable from the data and
“metaphorize” it into a geometric feature. We can map variables
metaphorically to a variety of geometric things: coordinate, length,
height, size, shape, colour, alpha(how dark?)….
The syntax uses:
aes(some_geometric_thing = some_variable)
Remember variable = column.
So if we were graphing information from penguins
, we
might map a penguin’s flipper_length_mm
column to the \(x\)
position, and the body_mass_g
column to
the \(y\) position.
Mapping Example-1
We can try another example of aesthetic mapping with the same
dataset:
Plot-1a
ggplot(data = penguins)
Plot-1b
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Plot-1c
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Mapping Example-2
We can try another example of aesthetic mapping with the same
dataset:
Plot-2a
ggplot(data = penguins)
Plot-2b
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Plot-2c
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Mapping Example-3
We can try another example of aesthetic mapping with the same
dataset:
Plot-3a
ggplot(data = penguins)
Plot-3b
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Plot-3c
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Mapping Example-4
We can try another example of aesthetic mapping with the same
dataset:
Plot-4a
ggplot(data = penguins)
Plot-4b
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Plot-4c
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Geometric objects
Geometric objects (geoms) control the type
of plot you create. Geoms are classified by their dimensionality:
- 0 dimensions - point, text
- 1 dimension - path, line
- 2 dimensions - polygon, interval
Each geom can only display certain aesthetics or
visual attributes of the geom. For example, a point geom has position,
color, shape, and size aesthetics.
We can also stack up geoms on top of one another to add layers to the
graph.
Plot1
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_line()
## Warning: Removed 2 row(s) containing missing values (geom_path).
Plot2
ggplot(data = penguins) +
geom_line(aes(x = bill_length_mm,
y = body_mass_g))
## Warning: Removed 2 row(s) containing missing values (geom_path).
Plot3
ggplot(data = penguins) +
geom_point(aes(x = bill_length_mm,
y = body_mass_g,
color = island,
shape = species)) +
ggtitle("A point geom with position and color and shape aesthetics")
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data = penguins,
aes(x = species)) + # x position => ?
# No need to type "mapping"...
geom_bar() + # Where does the height come from?
ggtitle("A bar geom with position and height aesthetics")
ggplot(data = penguins, aes(x = species)) +
geom_bar() +
ggtitle("A bar geom with position and height aesthetics")
- Position determines the starting location (origin) of each bar
- Height determines how tall to draw the bar. Here the height is based
on the number of observations in the dataset for each possible number of
cylinders.
Position adjustment
Sometimes with dense data we need to adjust the position of elements
on the plot, otherwise data points might obscure one another. Bar plots
frequently stack or dodge the bars to
avoid overlap:
count(x = penguins, species, island) %>%
ggplot(mapping = aes(x = species, y = n, fill = island)) +
geom_bar(stat = "identity") +
ggtitle(label = "A stacked bar chart")
count(x = penguins, species, island) %>%
ggplot(mapping = aes(x = species, y = n, fill = island)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle(label = "A dodged bar chart")
penguins %>%
ggplot(mapping = aes(x = species, fill = island)) +
geom_bar() +
ggtitle(label = "A stacked bar chart")
penguins %>%
ggplot(mapping = aes(x = species, fill = island)) +
geom_bar(position = "dodge") +
ggtitle(label = "A dodged bar chart")
Sometimes scatterplots with few unique \(x\) and \(y\) values are jittered
(random noise is added) to reduce overplotting.
ggplot(data = penguins,
mapping = aes(x = species,
y = body_mass_g)) +
geom_point() +
ggtitle("A point geom with obscured data points")
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data = penguins,
mapping = aes(x = species,
y = body_mass_g)) +
geom_jitter() +
ggtitle("A point geom with jittered data points")
## Warning: Removed 2 rows containing missing values (geom_point).
Scale
A scale controls how data is mapped to aesthetic
attributes, so we need one scale for every aesthetic property employed
in a layer. For example, this graph defines a scale for color:
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
The scale can be changed to use a different color palette:
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g,
color = species)) +
geom_point() +
scale_color_brewer(palette = "Dark2",direction = -1)
## Warning: Removed 2 rows containing missing values (geom_point).
Now we are using a different palette, but the scale is still
consistent: all Adelie penguins utilize the same color, whereas
Chinstrap use a new color but each Adelie still uses the same,
consistent color.
Coordinate system
A coordinate system (coord) maps the
position of objects onto the plane of the plot, and controls how the
axes and grid lines are drawn. Plots typically use two coordinates
(\(x, y\)), but could use any number of
coordinates. Most plots are drawn using the Cartesian
coordinate system:
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x = x1, y = y1, geom = "point", xlab = NULL, ylab = NULL) +
theme_bw()
p +
ggtitle(label = "Cartesian coordinate system")
ggplot(penguins, aes(flipper_length_mm, body_mass_g)) +
geom_point() +
coord_polar()
## Warning: Removed 2 rows containing missing values (geom_point).
This system requires a fixed and equal spacing between values on the
axes. That is, the graph draws the same distance between 1 and 2 as it
does between 5 and 6. The graph could be drawn using a semi-log
coordinate system which logarithmically compresses the
distance on an axis:
p +
coord_trans(y = "log10") +
ggtitle(label = "Semi-log coordinate system")
Or could even be drawn using polar
coordinates:
p +
coord_polar() +
ggtitle(label = "Polar coordinate system")
Faceting
Faceting can be used to split the data up into
subsets of the entire dataset. This is a powerful tool when
investigating whether patterns are the same or different across
conditions, and allows the subsets to be visualized on the same plot
(known as conditioned or trellis
plots). The faceting specification describes which variables should be
used to split up the data, and how they should be arranged.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point() +
facet_wrap(~ island)
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data = penguins, mapping = aes(x = bill_length_mm, y = body_mass_g, color = sex)) +
geom_point() +
facet_grid(species ~ island, scales = "free_y")
## Warning: Removed 2 rows containing missing values (geom_point).
# Ria's explanation: This code did not work becasue....
Defaults
Rather than explicitly declaring each component of a layered graphic
(which will use more code and introduces opportunities for errors), we
can establish intelligent defaults for specific geoms and scales. For
instance, whenever we want to use a bar geom, we can default to using a
stat that counts the number of observations in each group of our
variable in the \(x\) position.
Consider the following scenario: you wish to generate a scatterplot
visualizing the relationship between penguins’ bill_length and their
body_mass. With no defaults, the code to generate this graph is:
ggplot() +
layer(
data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g),
geom = "point",
stat = "identity",
position = "identity"
) +
scale_x_continuous() +
scale_y_continuous() +
coord_cartesian()
## Warning: Removed 2 rows containing missing values (geom_point).
The above code:
Creates a new plot object (ggplot
)
Adds a layer (layer
)
- Specifies the data (
penguins
)
- Maps engine bill length to the \(x\) position and body mass to the \(y\) position (
mapping
)
- Uses the point geometric transformation
(
geom = "point"
)
- Implements an identity transformation and position
(
stat = "identity"
and
position = "identity"
)
Establishes two continuous position scales
(scale_x_continuous
and
scale_y_continuous
)
Declares a cartesian coordinate system
(coord_cartesian
)
How can we simplify this using intelligent defaults?
We only need to specify one geom and stat, since each geom has a
default stat.
Cartesian coordinate systems are most commonly used, so it should
be the default.
Default scales can be added based on the aesthetic and type of
variables.
- Continuous values are transformed with a linear scaling.
- Discrete values are mapped to integers.
- Scales for aesthetics such as color, fill, and size can also be
intelligently defaulted.
Using these defaults, we can rewrite the above code as:
ggplot() +
geom_point(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g))
## Warning: Removed 2 rows containing missing values (geom_point).
This generates the exact same plot, but uses fewer lines of code.
Because multiple layers can use the same components (data, mapping,
etc.), we can also specify that information in the ggplot()
function rather than in the layer()
function:
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
And as we will learn, function arguments in R use specific ordering,
so we can omit the explicit call to data
and
mapping
:
ggplot(penguins, aes(bill_length_mm, body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
