Workshop 3 Data Wrangling and Data Summarising
In this workshop we shall take our first look at some key tools in the Tidyverse that will allow us to wrangle and tidy our data so that it’s in the format that we need in order to visualize and model it. By making our data wrangling reproducible (i.e., by coding it in R), we can easily re-run this stage of our analysis pipeline as new data gets added. Reproducibility of the data wrangling stage is a key part of the analysis process and often gets overlooked in terms of needing to ensure it is reproducible.
The Tidyverse is a collection of packages that all ‘play nicely’ with each other. They are based on a common philosophy where data are represented in rectangular format (i.e., with rows and columns). These rectangular structures are known in the Tidyverse as tibbles
. If you’re interested, you can read more about tibbles
in the R4DS book here.
3.1 Wrangling your data
Have a look at the following video where I walk you through this worksheet. Then I want you to work through the content by writing (and running) the script on your own machine.
3.1.1 Loading the Tidyverse
Let’s take our first look at data wrangling. We are going to start with a dataset that comes with the Tidyverse. The dataset is called mpg
and comprises fuel economy data from 1999 to 2008 for 38 popular models of cars in the US.
First, we need to load the tidyverse
library with the following:
If you run this line without having first installed the Tidyverse on your computer, you will encounter an error. R packages only need to be installed once, so if you want to load one into your library for the first time, you need to install it with install.packages(*packagename*)
.
For the tidyverse
we need to install it with:
Once you have installed the tidyverse
, you can then load it into your llbrary with the library()
function. You only ever need to install a package once on your machine (unless you have updated R or you want to install the most up-to-date version of a particular package). When you are writing your R scripts, you never want to have the install.packages()
function in the body of the script as if someone else were to run your script, this would update packages on their computer (which they might not want).
The mpg
dataset
The mpg
dataset is loaded as part of the Tidyverse. In the help file, which you can access by typing help(mpg)
or ?mpg
we see the following:
Description This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.
A data frame with 234 rows and 11 variables.
manufacturer - manufacturer model - model name displ - engine displacement, in litres year - year of manufacture cyl - number of cylinders trans -type of transmission drv -f = front-wheel drive, r = rear wheel drive, 4 = 4wd cty - city miles per gallon hwy - highway miles per gallon fl - fuel type class - “type” of car
Note you can also use help to inspect functions, for example: Typing ?sd
will show you the documentation for R’s Standard Deviation function.
3.1.2 Using head()
and str()
We can explore the mpg
dataset that is loaded with the Tidyverse in a number of ways. If we want to look at the first 6 lines of the dataset, we can use the head()
function.
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
We see that it is a tibble - or a rectangular data frame - made up of rows and columns. This is tidy
format where each observation corresponds to a row. Most of the analyses we will run in R involve tidy
data. Within the Tidyverse, the tibble
is the standard way to represent data. You’ll spend a lot of time tidying and wrangling your data to get it into this format! By doing this in R using a script that you write, you are making this key stage reproducible. You can run the script again on an updated or different dataset - thus likely saving you lots of time!
We can also ask for information about the structure of our dataset with str()
. This will tell us about the columns, what type of variable each is, the number of rows etc.
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
3.1.3 Use select()
to select columns
If we want to, we could just select one of the columns using the select()
function. Below we are just selecing the column entitled manufacturer
.
## # A tibble: 234 × 1
## manufacturer
## <chr>
## 1 audi
## 2 audi
## 3 audi
## 4 audi
## 5 audi
## 6 audi
## 7 audi
## 8 audi
## 9 audi
## 10 audi
## # ℹ 224 more rows
Related to the select()
function is rename()
. It does exactly what you think it might; it renames a column.
We can also look at the different car manufacturers in the dataset by using the distinct()
function. This gives us the unique manufacturer names. This function can be quite handy if you want to check a dataset for duplicates of (e.g.) participant IDs.
## # A tibble: 15 × 1
## manufacturer
## <chr>
## 1 audi
## 2 chevrolet
## 3 dodge
## 4 ford
## 5 honda
## 6 hyundai
## 7 jeep
## 8 land rover
## 9 lincoln
## 10 mercury
## 11 nissan
## 12 pontiac
## 13 subaru
## 14 toyota
## 15 volkswagen
3.1.4 Use filter()
to select rows
Sometimes we might want to select only a subset of rows in our dataset. We can do that using the filter()
function. For example, here we filter our dataset to include only cars made by ‘honda’.
## # A tibble: 9 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manual(m5) f 28 33 r subco…
## 2 honda civic 1.6 1999 4 auto(l4) f 24 32 r subco…
## 3 honda civic 1.6 1999 4 manual(m5) f 25 32 r subco…
## 4 honda civic 1.6 1999 4 manual(m5) f 23 29 p subco…
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r subco…
## 6 honda civic 1.8 2008 4 manual(m5) f 26 34 r subco…
## 7 honda civic 1.8 2008 4 auto(l5) f 25 36 r subco…
## 8 honda civic 1.8 2008 4 auto(l5) f 24 36 c subco…
## 9 honda civic 2 2008 4 manual(m6) f 21 29 p subco…
Note, we use the operator ==
which means ‘is equal to’. This is a logical operator - other logical operators include less than <
, greater than >
, less than or equal to <=
, greater then or equal to >=
, and is not equal to !=
.
We can also filter using a combination of possibilities via logical OR |
or logical AND &
. The first code chunk below filters the dataset for cases where the manufacturer is ‘honda’ OR ‘toyota’.
## # A tibble: 43 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manu… f 28 33 r subc…
## 2 honda civic 1.6 1999 4 auto… f 24 32 r subc…
## 3 honda civic 1.6 1999 4 manu… f 25 32 r subc…
## 4 honda civic 1.6 1999 4 manu… f 23 29 p subc…
## 5 honda civic 1.6 1999 4 auto… f 24 32 r subc…
## 6 honda civic 1.8 2008 4 manu… f 26 34 r subc…
## 7 honda civic 1.8 2008 4 auto… f 25 36 r subc…
## 8 honda civic 1.8 2008 4 auto… f 24 36 c subc…
## 9 honda civic 2 2008 4 manu… f 21 29 p subc…
## 10 toyota 4runner 4… 2.7 1999 4 manu… 4 15 20 r suv
## # ℹ 33 more rows
While below we filter for cases where the manufacturer is ‘honda’ and the year of manufacture is ‘1999’.
## # A tibble: 5 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 honda civic 1.6 1999 4 manual(m5) f 28 33 r subco…
## 2 honda civic 1.6 1999 4 auto(l4) f 24 32 r subco…
## 3 honda civic 1.6 1999 4 manual(m5) f 25 32 r subco…
## 4 honda civic 1.6 1999 4 manual(m5) f 23 29 p subco…
## 5 honda civic 1.6 1999 4 auto(l4) f 24 32 r subco…
Combining functions
We can combine the use of filter()
with select()
to filter for case where the manufacturer is ‘honda’, the year of manufacture is ‘1999’ and we only want to display these two columns plus those telling us about fuel economy - cty
and hwy
.
## # A tibble: 5 × 4
## manufacturer year cty hwy
## <chr> <int> <int> <int>
## 1 honda 1999 28 33
## 2 honda 1999 24 32
## 3 honda 1999 25 32
## 4 honda 1999 23 29
## 5 honda 1999 24 32
By combining just a few functions, you can imagine that we can build some quite complex data wrangling rules quite straightforwardly.
3.1.5 The pipe %>%
Note that in these examples above we used the %>%
operator - this is called the pipe and allows us to pass information from one side of the pipe to the other. You can read it out load as ‘and then’. All of the functions (such as select()
, filter()
etc.) in the Tidyverse are known as verbs, and they describe what they do. The pipe is one of the most commonly used operators in the Tidyverse and allows us to chain together different lines of code - with the output of each line being passed on as input into the next. In this example, the dataset mpg
is passed along to the distinct()
function where we ask for a list of the distinct (i.e., unique) manufacturers. This output itself is a vector. Vectors are a basic data structure and contain elements of the same type - for example, a bunch of numbers. We can add another line to our piped chain to tell us how many elements are in this vector. We could read this out loud as ‘take the dataset mpg, and then work out the distinct manufacturer names, and then count them’.
## # A tibble: 1 × 1
## n
## <int>
## 1 15
3.1.6 Tidying up a dataset
Tidying variable names
At the moment, the car manufacturer names are all in lower case. It would look a lot nicer if they were in title case (i.e., with capitalisation on the first letter of each word). We can use the mutate()
function to create a new column - this time, the name of the new column is also the name of the old column that we’re wanting to modify using the function str_to_title()
. What this will do is overwrite the column manufacturer
and replace it with the new version with the car manufacturer names in title case.
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 Audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 Audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 Audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 Audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 Audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 Audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 Audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 Audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 Audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 Audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
The column model
is also lowercase. Let’s make that title case too. We can use the mutate()
function to work over more than one column at the same time like this:
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 Audi A4 1.8 1999 4 auto… f 18 29 p comp…
## 2 Audi A4 1.8 1999 4 manu… f 21 29 p comp…
## 3 Audi A4 2 2008 4 manu… f 20 31 p comp…
## 4 Audi A4 2 2008 4 auto… f 21 30 p comp…
## 5 Audi A4 2.8 1999 6 auto… f 16 26 p comp…
## 6 Audi A4 2.8 1999 6 manu… f 18 26 p comp…
## 7 Audi A4 3.1 2008 6 auto… f 18 27 p comp…
## 8 Audi A4 Quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 Audi A4 Quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 Audi A4 Quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
There are quite a few columns there, so how about we select just the manufacturer, model, year, transmission, and hwy columns:
mpg %>%
mutate(manufacturer = str_to_title(manufacturer), model = str_to_title(model)) %>%
select(manufacturer, model, year, trans, hwy)
## # A tibble: 234 × 5
## manufacturer model year trans hwy
## <chr> <chr> <int> <chr> <int>
## 1 Audi A4 1999 auto(l5) 29
## 2 Audi A4 1999 manual(m5) 29
## 3 Audi A4 2008 manual(m6) 31
## 4 Audi A4 2008 auto(av) 30
## 5 Audi A4 1999 auto(l5) 26
## 6 Audi A4 1999 manual(m5) 26
## 7 Audi A4 2008 auto(av) 27
## 8 Audi A4 Quattro 1999 manual(m5) 26
## 9 Audi A4 Quattro 1999 auto(l5) 25
## 10 Audi A4 Quattro 2008 manual(m6) 28
## # ℹ 224 more rows
Recoding variables
In the real world, data frames do not always arrive on our computer in tidy format. Very often you need to engage in some data tidying before you can do anything useful with them. We’re going to look at an example of how we go from messy data to tidy data.
my_messy_data <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/my_data.csv")
We ran a reaction time experiment with 24 participants and 4 conditions - they are numbered 1-4 in our data file.
## # A tibble: 6 × 3
## participant condition rt
## <dbl> <dbl> <dbl>
## 1 1 1 879
## 2 1 2 1027
## 3 1 3 1108
## 4 1 4 765
## 5 2 1 1042
## 6 2 2 1050
This is a repeated measures design where we had one factor (Prime Type) with two levels (A vs. B) and a second factor (Target Type) with two levels (A vs. B). We want to recode our data frame so it better matches our experimental design. First we need to recode our 4 conditions like this:
Recode condition columns follows: Condition 1 = Prime A, Target A Condition 2 = Prime A, Target B Condition 3 = Prime B, Target A Condition 4 = Prime B, Target B
my_messy_data %>%
mutate(condition = recode(condition,
"1" = "PrimeA_TargetA",
"2" = "PrimeA_TargetB",
"3" = "PrimeB_TargetA",
"4" = "PrimeB_TargetB")) %>%
head()
## # A tibble: 6 × 3
## participant condition rt
## <dbl> <chr> <dbl>
## 1 1 PrimeA_TargetA 879
## 2 1 PrimeA_TargetB 1027
## 3 1 PrimeB_TargetA 1108
## 4 1 PrimeB_TargetB 765
## 5 2 PrimeA_TargetA 1042
## 6 2 PrimeA_TargetB 1050
We now need to separate out our Condition column into two - one for our first factor (Prime), and one for our second factor (Target). The separate()
function does just this - when used in conjunction with a piped tibble, it needs to know which column we want to separate, what new columns to create by separating that original column, and on what basis we want to do the separation. In the example below we tell separate()
that we want to separate the column labeled condition
into two new columns called Prime
and Target
and we want to do this at any points where a _
is present in the column to be separated.
my_messy_data %>%
mutate(condition = recode(condition,
"1" = "PrimeA_TargetA",
"2" = "PrimeA_TargetB",
"3" = "PrimeB_TargetA",
"4" = "PrimeB_TargetB")) %>%
separate(col = "condition", into = c("Prime", "Target"), sep = "_")
## # A tibble: 96 × 4
## participant Prime Target rt
## <dbl> <chr> <chr> <dbl>
## 1 1 PrimeA TargetA 879
## 2 1 PrimeA TargetB 1027
## 3 1 PrimeB TargetA 1108
## 4 1 PrimeB TargetB 765
## 5 2 PrimeA TargetA 1042
## 6 2 PrimeA TargetB 1050
## 7 2 PrimeB TargetA 942
## 8 2 PrimeB TargetB 945
## 9 3 PrimeA TargetA 943
## 10 3 PrimeA TargetB 910
## # ℹ 86 more rows
my_messy_data %>%
mutate(condition = recode(condition,
"1" = "PrimeA_TargetA",
"2" = "PrimeA_TargetB",
"3" = "PrimeB_TargetA",
"4" = "PrimeB_TargetB")) %>%
separate(col = "condition", into = c("Prime", "Target"), sep = "_") %>%
mutate(Prime = factor(Prime), Target = factor(Target))
## # A tibble: 96 × 4
## participant Prime Target rt
## <dbl> <fct> <fct> <dbl>
## 1 1 PrimeA TargetA 879
## 2 1 PrimeA TargetB 1027
## 3 1 PrimeB TargetA 1108
## 4 1 PrimeB TargetB 765
## 5 2 PrimeA TargetA 1042
## 6 2 PrimeA TargetB 1050
## 7 2 PrimeB TargetA 942
## 8 2 PrimeB TargetB 945
## 9 3 PrimeA TargetA 943
## 10 3 PrimeA TargetB 910
## # ℹ 86 more rows
The pivot functions
Most of the analysis we will conduct in R requires our data to be in tidy, or long, format. In such data sets, one row corresponds to one observation. In the real world, data are rarely in the right format for analysis. In R, the pivot_wider()
and pivot_longer()
functions are designed to reshape our data files. First, let’s load a datafile that is in wide format (i.e., multiple observations per row). It is from an experiment where we had four conditions (labelled Condition1, Condition2, Condition3, and Condition4). In addition to there being a column for each of the 4 conditions, we also have a column corresponding to participant ID. Each cell in the data set corresponds to a reaction time (measured in milliseconds).
my_wide_data <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/my_wide_data.csv")
The pivot_longer()
function
## # A tibble: 6 × 5
## ID Condition1 Condition2 Condition3 Condition4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 487 492 499 488
## 2 2 502 494 517 508
## 3 3 510 483 488 509
## 4 4 476 488 513 521
## 5 5 504 478 513 504
## 6 6 505 486 503 495
So, we can see the data file is in wide format. We want to reshape it to long format. We can do that using the pivot_longer()
function.
Minimally, we need to specify the data frame that we want to reshape, the columns that we want to ‘pivot’ into longer format, the name of the new column that we are creating, and the name of the column that will hold the values of our reshaped data frame. We are going to map the output to a variable I’m calling my_longer_data
.
my_longer_data <- my_wide_data %>%
pivot_longer(cols = c(Condition1, Condition2, Condition3, Condition4),
names_to = "Condition",
values_to = "RT")
Now let’s have a look at what our reshaped data frame looks like.
## # A tibble: 6 × 3
## ID Condition RT
## <dbl> <chr> <dbl>
## 1 1 Condition1 487
## 2 1 Condition2 492
## 3 1 Condition3 499
## 4 1 Condition4 488
## 5 2 Condition1 502
## 6 2 Condition2 494
So you can see our data are now in long - or tidy - format with one observation per row. Note that our Condition
column isn’t coded as a factor. It’s important that our data set reflects the structure of our experiment so let’s convert that column to a factor - note that in the following code we are now ‘saving’ the change as we are not mapping the output onto a variable name.
## # A tibble: 6 × 3
## ID Condition RT
## <dbl> <fct> <dbl>
## 1 1 Condition1 487
## 2 1 Condition2 492
## 3 1 Condition3 499
## 4 1 Condition4 488
## 5 2 Condition1 502
## 6 2 Condition2 494
The pivot_wider()
function
We can use the pivot_wider()
function to reshape a long data frame so that it goes from long to wide format. It works similarly to pivot_longer()
. Let’s take our new, long, data frame and turn it back into wide format. With pivot_wider()
we minimally need to specify the data frame that we want to reshape, and a pair or arguments (names_from and values_from) that describe from which column to get the name of the output column, and from which column to get the cell values.
We can check that our data set is back in wide format.
## # A tibble: 6 × 5
## ID Condition1 Condition2 Condition3 Condition4
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 487 492 499 488
## 2 2 502 494 517 508
## 3 3 510 483 488 509
## 4 4 476 488 513 521
## 5 5 504 478 513 504
## 6 6 505 486 503 495
3.1.7 Joining Two Datasets
Sometimes you might need to combine two datasets. For example, you might have one dataset that contains reading time data (like the one above) and another than contains individual difference measures for the participants in the first dataset. How would we go about combining these two datasets so that we end up with one that includes both the reading time data and the individual difference measures (that perhaps we want to covary out later)? Luckily, the dplyr
package contains a number of join functions that allows you to join together different tibbles. First, let’s load the data that contains the individual different measures.
individual_diffs <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/individual_diffs.csv")
Let’s look at the first few rows of the individual differences data. This dataset contains the ID numbers of our participants plus measures of IQ (the iq column) and Working Memory (the wm column).
## # A tibble: 6 × 3
## ID iq wm
## <dbl> <dbl> <dbl>
## 1 1 100 9
## 2 2 108 8
## 3 3 116 9
## 4 4 95 9
## 5 5 83 11
## 6 6 73 10
We want to combine this dataset with our reading time dataset from above my_longer_data
which looks like this:
## # A tibble: 6 × 3
## ID Condition RT
## <dbl> <chr> <dbl>
## 1 1 Condition1 487
## 2 1 Condition2 492
## 3 1 Condition3 499
## 4 1 Condition4 488
## 5 2 Condition1 502
## 6 2 Condition2 494
Full Join
We can combine using one of the join functions. There are a variety of options including full_join()
which includes all of the rows from both tibbles that we want to join. Other options include inner_join()
which keeps only the rows in tibble one that have a matching key in tibble 2, as well as left_join()
and right_join()
.
We now see that our dataset are combined as we’d expect.
## # A tibble: 128 × 5
## ID Condition RT iq wm
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Condition1 487 100 9
## 2 1 Condition2 492 100 9
## 3 1 Condition3 499 100 9
## 4 1 Condition4 488 100 9
## 5 2 Condition1 502 108 8
## 6 2 Condition2 494 108 8
## 7 2 Condition3 517 108 8
## 8 2 Condition4 508 108 8
## 9 3 Condition1 510 116 9
## 10 3 Condition2 483 116 9
## # ℹ 118 more rows
Left Join
Of course, you may be thinking that we could just do a quick bit of Excel cut and paste of the columns we want from one dataset to the other. But what about the case where our individual differences file contains 10,000 participant IDs (in random order) and we’re only interested in combining the two datasets where there is a match?
large_ind_diffs <- read_csv("https://raw.githubusercontent.com/ajstewartlang/03_data_wrangling/master/data/large_ind_diffs.csv")
## # A tibble: 6 × 3
## ID iq wm
## <dbl> <dbl> <dbl>
## 1 6057 93 7
## 2 2723 86 7
## 3 1088 97 9
## 4 8687 87 8
## 5 4223 77 11
## 6 369 95 9
We can actually use another join function (left_join()
) to combine these two datasets, but only where there is a match of ID with the first of the two datasets (my_longer_data
) in the function call.
## # A tibble: 128 × 5
## ID Condition RT iq wm
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Condition1 487 100 9
## 2 1 Condition2 492 100 9
## 3 1 Condition3 499 100 9
## 4 1 Condition4 488 100 9
## 5 2 Condition1 502 108 8
## 6 2 Condition2 494 108 8
## 7 2 Condition3 517 108 8
## 8 2 Condition4 508 108 8
## 9 3 Condition1 510 116 9
## 10 3 Condition2 483 116 9
## # ℹ 118 more rows
3.1.8 Your challenge (complete this before the in-person session)
Have a go at recreating the join above using RStudio Desktop. Use read_csv
to create the my_wide_data
tibble (as shown above) and use pivot_longer
to make my my_longer_data
. Then use read_csv
to create large_ind_diffs
and use left_join
to combine it with my_longer_data
3.2 Summarising your data
We’ll be using the mpg
dataset that is built into the tidyverse
for this workshop. This dataset contains information about cars (such as engine size, fuel economy) produced by a number of different manufacturers. Once a dataset has been tidied, often one of the first things we want to do is generate summary statistics, e.g. the means and standard deviations for one of our variables grouped by car manufacturer. Have a look at the following video where I walk you through this worksheet. Then I want you to work through the content by writing (and running) the script on your own machine.
Remember to set up a .Rproj file for this workshop before continuing. In your script, you’ll first need to load the tidyverse.
3.2.1 Using group_by()
and summarise()
We are going to use the group_by()
function to group the dataset, and then the summarise()
function to calculate the mean and sd of the hwy
variable. The summarise()
function can take a lot of different functions to give us summary statistics. To read more about the different options, type ?summarise
in the Console window. Commonly used ones are mean()
, median()
, sd()
.
## # A tibble: 15 × 4
## manufacturer mean_hwy sd_hwy number
## <chr> <dbl> <dbl> <int>
## 1 audi 26.4 2.18 18
## 2 chevrolet 21.9 5.11 19
## 3 dodge 17.9 3.57 37
## 4 ford 19.4 3.33 25
## 5 honda 32.6 2.55 9
## 6 hyundai 26.9 2.18 14
## 7 jeep 17.6 3.25 8
## 8 land rover 16.5 1.73 4
## 9 lincoln 17 1 3
## 10 mercury 18 1.15 4
## 11 nissan 24.6 5.09 13
## 12 pontiac 26.4 1.14 5
## 13 subaru 25.6 1.16 14
## 14 toyota 24.9 6.17 34
## 15 volkswagen 29.2 5.32 27
Note that this output is currently ordered alphabetically by the first column manufacturer
. What if we wanted to order this out by mean highway fuel economy highest (best) to lowest (worst)? We can use the arrange function.
Re-ordering the output with arrange()
mpg %>%
group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n()) %>%
arrange(mean_hwy)
## # A tibble: 15 × 4
## manufacturer mean_hwy sd_hwy number
## <chr> <dbl> <dbl> <int>
## 1 land rover 16.5 1.73 4
## 2 lincoln 17 1 3
## 3 jeep 17.6 3.25 8
## 4 dodge 17.9 3.57 37
## 5 mercury 18 1.15 4
## 6 ford 19.4 3.33 25
## 7 chevrolet 21.9 5.11 19
## 8 nissan 24.6 5.09 13
## 9 toyota 24.9 6.17 34
## 10 subaru 25.6 1.16 14
## 11 pontiac 26.4 1.14 5
## 12 audi 26.4 2.18 18
## 13 hyundai 26.9 2.18 14
## 14 volkswagen 29.2 5.32 27
## 15 honda 32.6 2.55 9
Hmm, so that isn’t what we want - this is going from lowest to highest which is the default in R. We can change that by putting a -
sign in from of the parameter we can to order by.
mpg %>%
group_by(manufacturer) %>%
summarise(mean_hwy = mean(hwy), sd_hwy = sd(hwy), number = n()) %>%
arrange(-mean_hwy)
## # A tibble: 15 × 4
## manufacturer mean_hwy sd_hwy number
## <chr> <dbl> <dbl> <int>
## 1 honda 32.6 2.55 9
## 2 volkswagen 29.2 5.32 27
## 3 hyundai 26.9 2.18 14
## 4 audi 26.4 2.18 18
## 5 pontiac 26.4 1.14 5
## 6 subaru 25.6 1.16 14
## 7 toyota 24.9 6.17 34
## 8 nissan 24.6 5.09 13
## 9 chevrolet 21.9 5.11 19
## 10 ford 19.4 3.33 25
## 11 mercury 18 1.15 4
## 12 dodge 17.9 3.57 37
## 13 jeep 17.6 3.25 8
## 14 lincoln 17 1 3
## 15 land rover 16.5 1.73 4
This is looking better.
The summarise_at()
variant
As well as using summarise()
, you can use related functions such as summarise_at()
. This is a scoped version of the summarise()
function that can be applied across multiple columns. Note, when using summarise_at()
you need to put the columns you want to summarise over in quotes. You also need to provide the summary function - in this case mean
. Finally, in case our dataset contains any missing values (indicated by NA
), we set the parameter na.rm = TRUE
. This will ensure that missing data points are removed before the operation is applied. If we had missing data, but didn’t tell R what we wanted to do with it, it would have thrown an error.
## # A tibble: 15 × 4
## manufacturer displ cty hwy
## <chr> <dbl> <dbl> <dbl>
## 1 audi 2.54 17.6 26.4
## 2 chevrolet 5.06 15 21.9
## 3 dodge 4.38 13.1 17.9
## 4 ford 4.54 14 19.4
## 5 honda 1.71 24.4 32.6
## 6 hyundai 2.43 18.6 26.9
## 7 jeep 4.58 13.5 17.6
## 8 land rover 4.3 11.5 16.5
## 9 lincoln 5.4 11.3 17
## 10 mercury 4.4 13.2 18
## 11 nissan 3.27 18.1 24.6
## 12 pontiac 3.96 17 26.4
## 13 subaru 2.46 19.3 25.6
## 14 toyota 2.95 18.5 24.9
## 15 volkswagen 2.26 20.9 29.2
The summarise_if()
variant
Imagine we had a really big dataset and wanted to summarise all columns that were of a certain type. We can use the summarise_if()
function to work out the mean for each of our car manufactures as follows:
## # A tibble: 15 × 6
## manufacturer displ year cyl cty hwy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 audi 2.54 2004. 5.22 17.6 26.4
## 2 chevrolet 5.06 2005. 7.26 15 21.9
## 3 dodge 4.38 2004. 7.08 13.1 17.9
## 4 ford 4.54 2003. 7.2 14 19.4
## 5 honda 1.71 2003 4 24.4 32.6
## 6 hyundai 2.43 2004. 4.86 18.6 26.9
## 7 jeep 4.58 2006. 7.25 13.5 17.6
## 8 land rover 4.3 2004. 8 11.5 16.5
## 9 lincoln 5.4 2002 8 11.3 17
## 10 mercury 4.4 2004. 7 13.2 18
## 11 nissan 3.27 2004. 5.54 18.1 24.6
## 12 pontiac 3.96 2003. 6.4 17 26.4
## 13 subaru 2.46 2004. 4 19.3 25.6
## 14 toyota 2.95 2003. 5.12 18.5 24.9
## 15 volkswagen 2.26 2003. 4.59 20.9 29.2
The first parameter in summarise_if()
is the logical test applied to each column - in this case, if a column is numeric (i.e., a number) - then the test evaluates to TRUE and the second function, mean
, is applied. Again, we tell R to ignore missing (NA
) data values with the na.rm = TRUE
parameter.
R functions differ in terms of what arguments they take. I often forget them - if you start typing a function name, you’ll get a little bubble above where you’re typing to remind you what parameters are needed. And if you can’t remember the details, you can just type help(function_name)
or ?function_name
in the console for any function that you need help with. A lot of data analysis with R (or Python or any other language really) involves a fair bit of Googling. This is normal. There are some things I can never remember and am always having to look up!
3.2.2 Adding columns using mutate()
We can add a new column that I’m calling mean_hwy
using the mutate()
function like this.
## # A tibble: 234 × 13
## # Groups: manufacturer [15]
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 2 more variables: mean_hwy <dbl>, sd_hwy <dbl>
We have too many columns to display on this page so we can drop a couple by using the select()
function slightly differently. By putting a -
sign in front of a column names in select()
we end up dropping it.
mpg %>%
group_by(manufacturer) %>%
mutate(mean_hwy = mean(hwy), sd_hwy = sd(hwy)) %>%
select(-class, -trans)
## # A tibble: 234 × 11
## # Groups: manufacturer [15]
## manufacturer model displ year cyl drv cty hwy fl mean_hwy sd_hwy
## <chr> <chr> <dbl> <int> <int> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 audi a4 1.8 1999 4 f 18 29 p 26.4 2.18
## 2 audi a4 1.8 1999 4 f 21 29 p 26.4 2.18
## 3 audi a4 2 2008 4 f 20 31 p 26.4 2.18
## 4 audi a4 2 2008 4 f 21 30 p 26.4 2.18
## 5 audi a4 2.8 1999 6 f 16 26 p 26.4 2.18
## 6 audi a4 2.8 1999 6 f 18 26 p 26.4 2.18
## 7 audi a4 3.1 2008 6 f 18 27 p 26.4 2.18
## 8 audi a4 qu… 1.8 1999 4 4 18 26 p 26.4 2.18
## 9 audi a4 qu… 1.8 1999 4 4 16 25 p 26.4 2.18
## 10 audi a4 qu… 2 2008 4 4 20 28 p 26.4 2.18
## # ℹ 224 more rows
Note that this doesn’t change the mpg
dataset permanently - the changes won’t be saved unless we map the output of this code onto a new variable. Below I am doing this by using the assignment operator <-
to map it onto a new variable I’m calling mpg_with_mean
. Note that we remove the grouping at the end as we don’t want our grouping rule to remain in our new data frame.
mpg_with_mean <- mpg %>%
group_by(manufacturer) %>%
mutate(mean_hwy = mean(hwy), sd_hyw = sd(hwy)) %>%
ungroup() %>%
select(-class, -trans)
We can then inspect this new variable using head()
and str()
.
## # A tibble: 6 × 11
## manufacturer model displ year cyl drv cty hwy fl mean_hwy sd_hyw
## <chr> <chr> <dbl> <int> <int> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 audi a4 1.8 1999 4 f 18 29 p 26.4 2.18
## 2 audi a4 1.8 1999 4 f 21 29 p 26.4 2.18
## 3 audi a4 2 2008 4 f 20 31 p 26.4 2.18
## 4 audi a4 2 2008 4 f 21 30 p 26.4 2.18
## 5 audi a4 2.8 1999 6 f 16 26 p 26.4 2.18
## 6 audi a4 2.8 1999 6 f 18 26 p 26.4 2.18
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ mean_hwy : num [1:234] 26.4 26.4 26.4 26.4 26.4 ...
## $ sd_hyw : num [1:234] 2.18 2.18 2.18 2.18 2.18 ...
3.2.3 Your challenge (do this during the in-person session)
The Tidyverse
has a number of other built-in data sets. Another one is the starwars
data set. You can have a look at it by typing starwars
or by typing view(starwars)
. This second option will open the data set in a new window. Have a go playing around with it. Work out the mean height of humans in the Star Wars universe. There might be some missing data (indicated by NA
). You can use the na.rm = TRUE
parameter in your summarise()
function to ignore these values when generating your summary statistics.
Another way to filter out NA
values is to use the filter()
function in your pipeline. The function is.na()
returns a logical value of TRUE of FALSE. The operator !
means NOT, so the expression !is.na(height)
will return TRUE when the height value is present, and FALSE if absent. By combining this with filter()
we have the line filter(!is.na(height))
which will filter only the cases where we have height data (i.e., !is.na(height)
is TRUE). So your code might look like this:
starwars %>%
filter(!is.na(height)) %>%
filter(species == "Human") %>%
summarise(mean_height = mean(height))
Replace the word mean
in the summarise()
line with median
. What other things can you replace it with? Hint: type ?summarise
in the console. What other summary information can you extract from this dataset?