Abstract
Part of myR for Artists and Designers
workshop course.
This RMarkdown document is part of a Workshop on Data Visualization with R. The material is based on A Layered Grammar of Graphics by Hadley Wickham. The intent is to build Skill in coding in R, and also appreciate R as a way to metaphorically visualize information of various kinds, using predominantly geometric figures and structures.
All RMarkdown files combine code, text, web-images, and figures developed using code. Everything is text; code chunks are enclosed in fences (```)
The method followed will be based on PRIMM:
parameters
of the code
do and write comments to explain. What bells and
whistles can you see?parameters
code provided to
understand the options
available. Write
comments to show what you have aimed for and achieved.In the following:
When it is YOUR TURN: wherever you see YOUR TURN, please respond with explanations, more questions and if you are already confident, code chunks to create new calculations and graphs.
The setup
code chunk below brings into
our coding session R packages that provide specific
computational abilities and also datasets which we can
use.
To reiterate: Packages and datasets are not the same thing !! Packages are (small) collections of programs. Datasets are just….information.
knitr::opts_chunk$set(echo = TRUE,warning = TRUE)
library(tidyverse)
library(palmerpenguins)
library(janitor)
In this RMarkdown document, we try to connect story-making questions with two ideas:
So: a question identifies a variable and a question also leads to a Computation or a Data Visualization. The idea is to get the intuition behind data, and iteratively ask the questions and form hypotheses and perform Exploratory Data Analysis (EDA) using graphs and charts in R.
At some point we may find that the data is not adequate to prove/disprove a particular hypothesis and need to get into further research / experimental design. It is possible to design the research experiments also in R, but we may cover that much later.
So how do we ask questions? These are usually with interrogative pronouns in English: What? Who? Where? Which? What Kind? How? and so on.
penguins
datasetnames(penguins) # Column, i.e. Variable names
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
head(penguins) # first six rows
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
tail(penguins) # Last six rows
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female | 2009 |
Chinstrap | Dream | 55.8 | 19.8 | 207 | 4000 | male | 2009 |
Chinstrap | Dream | 43.5 | 18.1 | 202 | 3400 | female | 2009 |
Chinstrap | Dream | 49.6 | 18.2 | 193 | 3775 | male | 2009 |
Chinstrap | Dream | 50.8 | 19.0 | 210 | 4100 | male | 2009 |
Chinstrap | Dream | 50.2 | 18.7 | 198 | 3775 | female | 2009 |
dim(penguins) # Size of dataset
## [1] 344 8
# Check for missing data
any(is.na(penguins) == TRUE)
## [1] TRUE
names()
?State a few questions after discussion with your friend and state
possible variables, or what you could DO with the variables, as an
answer.
E.g. Q. How many penguins? A. We need to count…rows?
In the Table below, we have a rough mapping of interrogative pronouns to the kinds of variables in the data:
Pronoun | Answer | Variable/Scale | Example | What Operations? |
---|---|---|---|---|
How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. | Quantitative/Ratio | Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate | Correlation |
How Many / Much / Heavy? Few? Seldom? Often? When? | Quantities with Scale. Differences are meaningful, but not products or ratios | Quantitative/Interval | pH,SAT score(200-800),Credit score(300-850),SAT score(200-800),Year of Starting College | Mean,Standard Deviation |
How, What Kind, What Sort | A Manner / Method, Type or Attribute from a list, with list items in some ” order” ( e.g. good, better, improved, best..) | Qualitative/Ordinal | Socioeconomic status (Low income, Middle income, High income),Education level (HighSchool, BS, MS, PhD),Satisfaction rating(Very much Dislike, Dislike, Neutral, Like, Very Much Like) | Median,Percentile |
What, Who, Where, Whom, Which | Name, Place, Animal, Thing | Qualitative/Nominal | Name | Count no. of cases,Mode |
As you go from Qualitative to Quantitative data types in the table, I hope you can detect a movement from fuzzy groups/categories to more and more crystallized numbers. Each variable/scale can be subjected to the operations of the previous group.
In the words of S.S.Stevens.:
the basic operations needed to create each type of scale is cumulative: to an operation listed opposite a particular scale must be added all those operations preceding it.
See his classic paper here PDF. Do think about this as you work with data.
Do take a look at these references:
mpg
datasetnames(mpg) # Column, i.e. Variable names
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
head(mpg) # first six rows
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
tail(mpg) # Last six rows
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
---|---|---|---|---|---|---|---|---|---|---|
volkswagen | passat | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | midsize |
volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
dim(mpg) # Size of dataset
## [1] 234 11
# Check for missing data
any(is.na(mpg) == TRUE)
## [1] FALSE
Look carefully at the variables here. How would you interpret say the
cyl
variable? Is it a number and therefore Quantitative, or
could it be something else?
A first task is often the reading in of external data. Data is best
stored and shared in a CSV file format
.Download this CSV
file into your project folder:
To read this CSV data into our R Session, we use a command called
read_csv
from the readr
package:
mpg_new <- readr::read_csv("mpg_uppercase.csv")
## Rows: 234 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): MANUFACTURER, MODEL, TRANS, DRV, FL, CLASS
## dbl (5): DISPL, YEAR, CYL, CTY, HWY
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
As can be seen , read_csv
tells us by default what the
column names are and the types they are: chr
and
dbl
in this case.
In the event that the column names are not very good or evocative, we
can set name_repair
= make_clean_names
inside
read_csv
; this additional function is available via the
janitor
package.
read_csv("mpg_uppercase.csv",
show_col_types = TRUE,
name_repair = make_clean_names) %>% # needs `janitor`
glimpse()
## Rows: 234 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): manufacturer, model, trans, drv, fl, class
## dbl (5): displ, year, cyl, cty, hwy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <dbl> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <dbl> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <dbl> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <dbl> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Note that the default naming is based in snake_case
.
There are other ways and if we desire to use them, we need to make
make_clean_names
as what is called a lambda
function with extra parameters:
read_csv("mpg_uppercase.csv",
show_col_types = TRUE,
name_repair =
# Here is the lambda function
# To be used only when you want to go beyond the defaults
# using additional parameters. e.g `case`
~ make_clean_names(.,
case = "big_camel")) %>%
# needs `janitor`
# `.` denotes the VECTOR of column names
glimpse()
## Rows: 234 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Manufacturer, Model, Trans, Drv, Fl, Class
## dbl (5): Displ, Year, Cyl, Cty, Hwy
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 234
## Columns: 11
## $ Manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ Model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ Displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ Year <dbl> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ Cyl <dbl> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ Trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ Drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ Cty <dbl> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ Hwy <dbl> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ Fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ Class <chr> "compact", "compact", "compact", "compact", "compact", "c…
Note the usage of ~
to make a function: this is needed
only if we want to pass (additional) arguments to
make_clean_names
. (We will find this also when we encounter
the purrr
package where we run functions iteratively over
each individual entry in a vector/column or list-column.)
Now that we know how to access “R internal” datasets and to read in external datasets, we can also respond to ( more complex ) Questions, with not just a variable but one of two things:
What sort of calculations, and visuals charts can we create with different kinds of variables, taken singly or together? Let us write some simple English descriptions of measures and visuals and see what commands they use in R.
Here we will use the Grammar of a package called ggplot
,
which we will encounter in Lab:04. Let us go with our intuition with the
code in the following sections.
Note: since we saw a couple of missing entries in the
penguins
dataset, let us remove them for now.
penguins <- penguins %>% drop_na()
levels
/ Counts for each
level
count / tally
of no. of penguins on each island or in
each speciessort
and order
by island or speciesgeom_bar
/ geom_bar + coord_polar()
/ Find
out!!penguins %>% count(species)
species | n |
---|---|
Adelie | 146 |
Chinstrap | 68 |
Gentoo | 119 |
ggplot(penguins) + geom_bar(aes(x = island))
ggplot(penguins) + geom_bar(aes(x = sex))
Use the mpg_new
dataset to create a few Single
Categorical Graphs.
Questions: How many? How few? How often? How much?
Calculations: max / min / mean / mode / (units)
max()
, min()
, range()
,
mean()
, mode(), summary()
geom_histogram()
/ geom_density()
max(penguins$bill_length_mm)
## [1] 59.6
range(penguins$bill_length_mm, na.rm =TRUE)
## [1] 32.1 59.6
summary(penguins$flipper_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 172 190 197 201 213 231
ggplot(penguins) + geom_density(aes(bill_length_mm))
ggplot(penguins) + geom_histogram(aes(x = bill_length_mm))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Are all the above Quantitative variables ratio variables?
Justify. Use the mpg_new
dataset to create a few Single
Quant Graphs.
We can easily extend our intuition about one quantitative variable, to a pair of them. What Questions can we ask?
Questions: How many of this vs How many of that? Does this depend upon that? How are they related? (Remember \(y = mx + c\) and friends?)
Calculations: Correlation / Covariance / T-test / Chi-Square Test for Two Means etc. We won’t go into this here !
Charts: Scatter Plot / Line Plot / Regression i.e. best fit lines
cor(penguins$bill_length_mm, penguins$bill_depth_mm)
## [1] -0.2286256
ggplot(penguins) +
geom_point(aes(x = flipper_length_mm,
y = body_mass_g))
ggplot(penguins) +
geom_point(aes(x = flipper_length_mm,
y = bill_length_mm))
Use the mpg_new
dataset to create a few Quant vs Quant
Graphs.
What sort of question could we ask that involves two categorical variables?
Questions: How Many of this Kind( ~x) are How Many of that Kind( ~y ) ?
Calculations: Counts and Tallies sliced by Category
counts
, tally
Charts: Stacked Bar Charts / Grouped Bar Charts / Segmented Bar Chart / Mosaic Chart
geom_bar()
fill
,
color
.position
of the
bars.ggplot(penguins) + geom_bar(aes(x = island,
fill = species),
position = "stack")
Storyline: तीन पेनगीन। और तुम भी तीन(Oh never mind!)
Use the mpg_new
dataset to create a few Quant vs
Categorical Graphs.
Finally, what if we want to look at Quant variables and Qual variables together? What questions could we ask?
Questions: How much of this is Which Kind of that? How many vs Which? How many vs How?
Calculations: Counts, Means, Ranges etc., grouped by Categorical variable.
ggplot(penguins) +
geom_density(aes(x = body_mass_g,
color = island,
fill = island),
alpha = 0.3)
geom_bar
/ geom_density
/
geom_violin
/ geom_boxplot
using Categorical
Variable for groupingggplot(penguins) +
geom_density(aes(x = body_mass_g,
color = island,
fill = island),
alpha = 0.3)
ggplot(penguins) +
geom_histogram(aes(x = flipper_length_mm,
fill = sex))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use the mpg_new
dataset to create a few Single
Categorical Graphs.
Any data set in your R installation. Type data()
in
your console to see what is available.
diamonds
. This dataset is part of the tidyverse
package so just type diamonds
in your code and there it
is.
gapminder
!! Yes!!You will need to install the
gapminder
package to access this dataset
mosaicData
package datasets. Install
mosaicData
first !!
data.world
: Find Datasets of your choice: https://docs.data.world/en/64499-64516-Quickstarts-and-tutorials.html
kaggle
: https://www.kaggle.com/datasets
Ask me for help any time!