This is also known as data munging or data wrangling and is one of the most common processes in any bioinformatics or statistical analysis.
We’ll cover:
SQL-
and Excel-
like functions in
dplyr
select()
: selecting and rearranging columnsrename()
: renaming columnsmutate()
: creating and overwriting columnsarrange()
: reordering columnssummarise()
: obtaining summary values for a columngroup_by()
: defining grouping variables for
summariestidyr
pivot_longer()
: changing from wide to long formpivot_wider()
: changing from long to wide formstringr
str_extract_all()
: using regular expressions to extract
patterns within stringsThe datasets required are contained in the files 1)
transport.csv
and 2) pcr.csv
Create a new R
Script called DataWrangling.R
dplyr
dplyr
This package is also loaded by default when we enter
library(tidyverse)
. For this session we’ll use the same
data as for the previous session. Note that now we’ll define it as an R
object and then we’ll pass it to read_csv()
. To check that
we have this correct, we’ll first check that the file exists, using the
appropriately named function file.exists()
.
library(tidyverse)
csvFile <- file.path("~", "data", "intro_r", "transport.csv")
file.exists(csvFile)
data <- read_csv(csvFile)
data
So we have a 80 x 6 data frame
, which has the familiar
structure of rows and columns, and may look a little like a spreadsheet
in Excel. We can check the dimensions of this object.
dim(data)
nrow(data)
ncol(data)
Our tasks will be to:
data.frame
The conventional R
method of sub-setting a
data.frame
is to use the square brackets ([]
)
and specify rows/columns by position or by name.
For example, to get back the first value (i.e., row) of the second column we could type:
data[1, 2]
Or we could get back the first 5 names
data[1:5, "name"]
The package dplyr
has an additional method of selecting
columns, by using the function select()
.
select(data, name, transport)
Note that in the above, we have only printed the results to the
screen and haven’t created any new R
objects.
An advantage of this method, as opposed to the square brackets is the
use of some helper functions starts_with()
,
ends_with()
, contains()
and
everything()
. We can use these to reorder the columns
easily. Try the following, and you’ll see how useful this can be.
select(data, name, everything())
select(data, ends_with("ght"))
select(data, contains("a"))
We can also use the -
sign before a name to remove
columns.
select(data, -name)
It looks like the frst column (X1
) is actually just the
rownames written to the file when saved by our collaborators. Let’s
remove that column, and this time we’ll over-write the object.
If you haven’t been keeping track of the above in your script,
this would be an important line to put there as we are overwriting the
original object. If we delete a column and then realise we need
it back , we can just rerun our code to reload the object.
data <- select(data, -1)
This is an alternate method to skipping the column when importing
using read_csv()
. In the function select()
, we
can also use column numbers but in general it is best practice to use
column names.
We can add extra columns using mutate()
. It’s clear that
the height here is provided in cm, but for BMI calculations we
would need height in m. We can add a column and perform
calculations add the same time.
mutate(data, height_m = height/100)
We can also perform multiple column additions in the same step.
mutate(data, height_m = height/100, BMI = weight / height_m^2)
Note that we haven’t overwritten our original object, so let’s add
those columns permanently to our data.frame
. Don’t
forget to add this line to your script! Add a comment
explaining what you’ve done if you like. It’s probably a good idea for
reading your code back.
data <- mutate(data, height_m = height/100, BMI = weight / height_m^2)
Now we have a column called height
and another called
height_m
so it might be sensible to rename our original
column as height_cm
to make this clear. The function
rename()
is all we need to perform this action. This time
we’ll just overwrite with the new version of data
straight
away.
data <- rename(data, height_cm = height)
data
Many of us are familiar with the Auto-filter in Excel, and
dplyr
has equivalent function, called
filter()
. We can use logical tests to build up complex
filtering criteria. Again, note that in the next few lines, we are not
overwriting our object but are just exploring our data.
filter(data, gender == "male")
Notice that in the above line we used the double equals
(==
) sign. This is common syntax in most programming
languages, whilst a single equals sign (=
) usually means we
are assigning a value to an object or variable. To perform the test
not equal to, we replace the first equals sign with an
exclamation mark (!
).
filter(data, gender != "male")
We can build up complex filters by adding them inside the filter function with a comma between them.
filter(data, gender == "male", height_cm > 175)
filter(data, transport == "car", gender == "female")
We can also use the function arrange()
to sort our data,
just like the sort function in Excel. By default, values are sorted in
ascending order, so to reverse this we just wrap the column name in
desc()
.
arrange(data, weight)
arrange(data, desc(weight))
We can also sort on multiple columns in the same line.
arrange(data, transport, height_cm)
%>%
This is where dplyr
steps up a gear. We can chain
functions together using the symbol %>%
, which behaves
like a pipe symbol (|
) in the bash shell. This
function/symbol is called The Magrittr or The Pipe
after a famous
painting by Rene Magritte It is contained in the package
magrittr
, and is loaded by dplyr
every time we
load dplyr
.
The %>%
symbol places whatever precedes it as the
first argument of the next function. We can think of the
%>%
symbol as saying then. As in, do the first
step then do the next step, and so on. An alternative way to
write our previous line of code would be:
data %>% arrange(transport, height_cm)
We can now use this to combine functions into a longer chain of commands, and there is no limit to the number of functions you can chain together.
data %>% filter(transport == "bike") %>% arrange(weight)
This specifically works here because every function takes a
data.frame
as input, and provides a data.frame
as output. From this point forward, we’ll use this as our standard
syntax when using dplyr
functions.
As we’re now able to build up long chains of commands, we can also
spread our code over multiple lines. Ending a line with
%>%
symbol let’s R
know that another line
is coming. This has the dual advantage of being easier to read later,
and enabling us to write comments at the end of each line.
data %>% # Take our original dataset
filter(transport == "bike") %>% # Find the cyclists
arrange(weight) # Arrange by weight
We can get summaries for entire columns:
data %>% summarise(mean(weight), mean(height_cm))
Or we could chain together a few commands, and provide new names for our summary columns.
data %>%
filter(gender == "female",
transport == "bike") %>%
summarise(max_BMI = max(BMI),
mn_height = mean(height_cm))
We can use any function here that returns a single value, such as
min()
, max()
, mean()
,
sd()
, median()
etc.
As an additionally useful feature, we can group categorical variables by their levels, and even count how many measurements we have.
data %>%
group_by(gender, transport) %>%
summarise(mn_weight = mean(weight),
mn_height =mean(height_cm),
mn_BMI = mean(BMI),
n = n())
As you can see, this is very similar to Excel, except we don’t have
to repeat anything if new data arrives. We just load the new file, and
run our code. And we can remember exactly what we’ve done, without
accidentally over-writing our original data. It’s all happened in
R
while our original file is still unchanged on our
hard-drive. This can aid with reproducible research, and also help
prevent catastrophes.
In this section we will need the pcr.csv
dataset in the
data folder. Let’s load the data first by placing this at the beginning
of your new script file.
library(tidyverse)
pcrFile <- file.path("~", "data", "intro_r", "pcr.csv")
file.exists(pcrFile)
pcrData <- read_csv(pcrFile)
Here we have RT-PCR data from T cells as a time course for two
treatments (Resting and Stim). The values are Ct
values and
we have three genes under investigation.
This dataset is in what we refer to as wide
form where
we have a row of measurements for each individual gene. The information
is structured around the gene. In long
form, the
information would be structured around the measurement.
To perform this in R
we can simply use the package
tidyr
which is loaded with the call to
library(tidyverse)
. The function we’ll use is
pivot_longer()
, which gathers multiple columns
into a single longer column.
pcrData %>% pivot_longer(-Gene, names_to = "variable", values_to = "value")
Notice that this has held the Gene
column fixed but the
remainder of the columns have been ‘gathered’ down the screen. The
column names are in a new column called variable
, and the
values are in a column called value
. Now you can see that
we have a unique row for each measurement, containing all the
information about that measurement.
By adding the cols
argument -Gene
we are
telling pivot_longer
to ignore the Gene
column
when gathering up the data. An alternative method would be to use
ends_with("hr")
because all of the columns we want to
retain end with the character string “hr”. Try it if you are
interested.
Let’s save that as a new object, and give those columns better names while we’re at it.
pcrLong <- pcrData %>% pivot_longer(-Gene, names_to = "Treatment", values_to = "Ct")
head(pcrLong)
The important column here is called Treatment
, but it
actually contains data that should be placed in two columns. We can
separate this column at the _
using the function
separate()
.
pcrLong %>%
separate(Treatment, into = c("Treatment", "Timepoint"), sep = "_")
If we wanted to return the ‘pcrLong’ data to wide form (before
separating the original Treatment
variable into two
variables) we can use the function pivot_wider()
.
pcrLong %>% pivot_wider(names_from = "Treatment", values_from = "Ct")