Working with Data Frames

Introduction

This is also known as data munging or data wrangling and is one of the most common processes in any bioinformatics or statistical analysis.

We’ll cover:

  • SQL- and Excel-like functions in dplyr
    • select(): selecting and rearranging columns
    • rename(): renaming columns
    • mutate(): creating and overwriting columns
    • arrange(): reordering columns
    • summarise(): obtaining summary values for a column
    • group_by(): defining grouping variables for summaries
  • Changing from wide to long form using tidyr
    • pivot_longer(): changing from wide to long form
    • pivot_wider(): changing from long to wide form
  • Editing text using stringr
    • str_extract_all(): using regular expressions to extract patterns within strings

The datasets required are contained in the files 1) transport.csv and 2) pcr.csv Create a new R Script called DataWrangling.R

The package dplyr

Starting with dplyr

This package is also loaded by default when we enter library(tidyverse). For this session we’ll use the same data as for the previous session. Note that now we’ll define it as an R object and then we’ll pass it to read_csv(). To check that we have this correct, we’ll first check that the file exists, using the appropriately named function file.exists().

library(tidyverse)
csvFile <- file.path("~", "data", "intro_r", "transport.csv")
file.exists(csvFile)
data <- read_csv(csvFile)
data

So we have a 80 x 6 data frame, which has the familiar structure of rows and columns, and may look a little like a spreadsheet in Excel. We can check the dimensions of this object.

dim(data)
nrow(data)
ncol(data)

Our tasks will be to:

  • remove any meaningless columns
  • count how many of each gender use each transport method
  • add a column with BMI

Sub-setting a data.frame

The conventional R method of sub-setting a data.frame is to use the square brackets ([]) and specify rows/columns by position or by name.

For example, to get back the first value (i.e., row) of the second column we could type:

data[1, 2]

Or we could get back the first 5 names

data[1:5, "name"]

The package dplyr has an additional method of selecting columns, by using the function select().

select(data, name, transport)

Note that in the above, we have only printed the results to the screen and haven’t created any new R objects.

An advantage of this method, as opposed to the square brackets is the use of some helper functions starts_with(), ends_with(), contains() and everything(). We can use these to reorder the columns easily. Try the following, and you’ll see how useful this can be.

select(data, name, everything())
select(data, ends_with("ght"))
select(data, contains("a"))

We can also use the - sign before a name to remove columns.

select(data, -name)

It looks like the frst column (X1) is actually just the rownames written to the file when saved by our collaborators. Let’s remove that column, and this time we’ll over-write the object. If you haven’t been keeping track of the above in your script, this would be an important line to put there as we are overwriting the original object. If we delete a column and then realise we need it back , we can just rerun our code to reload the object.

data <- select(data, -1)

This is an alternate method to skipping the column when importing using read_csv(). In the function select(), we can also use column numbers but in general it is best practice to use column names.

Why would using column names be better than numbers Using column names is much easier to read back, and to communicate with others. We immediately know what we’ve done, whereas if we use numbers, we would have to reload the object, then look to see what columns correspond to the given numbers. Given that you’ll often be looking back at your own code, this is a very good habit to get into, and you will be making your life easier.

Adding extra columns

We can add extra columns using mutate(). It’s clear that the height here is provided in cm, but for BMI calculations we would need height in m. We can add a column and perform calculations add the same time.

mutate(data, height_m = height/100)

We can also perform multiple column additions in the same step.

mutate(data, height_m = height/100, BMI = weight / height_m^2)

Note that we haven’t overwritten our original object, so let’s add those columns permanently to our data.frame. Don’t forget to add this line to your script! Add a comment explaining what you’ve done if you like. It’s probably a good idea for reading your code back.

data <- mutate(data, height_m = height/100, BMI = weight / height_m^2)

Renaming columns

Now we have a column called height and another called height_m so it might be sensible to rename our original column as height_cm to make this clear. The function rename() is all we need to perform this action. This time we’ll just overwrite with the new version of data straight away.

data <- rename(data, height_cm = height)
data

Filtering data

Many of us are familiar with the Auto-filter in Excel, and dplyr has equivalent function, called filter(). We can use logical tests to build up complex filtering criteria. Again, note that in the next few lines, we are not overwriting our object but are just exploring our data.

filter(data, gender == "male")

Notice that in the above line we used the double equals (==) sign. This is common syntax in most programming languages, whilst a single equals sign (=) usually means we are assigning a value to an object or variable. To perform the test not equal to, we replace the first equals sign with an exclamation mark (!).

filter(data, gender != "male")

We can build up complex filters by adding them inside the filter function with a comma between them.

filter(data, gender == "male", height_cm > 175)
filter(data, transport == "car", gender == "female")

Sorting columns

We can also use the function arrange() to sort our data, just like the sort function in Excel. By default, values are sorted in ascending order, so to reverse this we just wrap the column name in desc().

arrange(data, weight)
arrange(data, desc(weight))

We can also sort on multiple columns in the same line.

arrange(data, transport, height_cm)

Combining Functions using %>%

This is where dplyr steps up a gear. We can chain functions together using the symbol %>%, which behaves like a pipe symbol (|) in the bash shell. This function/symbol is called The Magrittr or The Pipe after a famous painting by Rene Magritte It is contained in the package magrittr, and is loaded by dplyr every time we load dplyr.

The %>% symbol places whatever precedes it as the first argument of the next function. We can think of the %>% symbol as saying then. As in, do the first step then do the next step, and so on. An alternative way to write our previous line of code would be:

data %>% arrange(transport, height_cm)

We can now use this to combine functions into a longer chain of commands, and there is no limit to the number of functions you can chain together.

data %>% filter(transport == "bike") %>% arrange(weight)

This specifically works here because every function takes a data.frame as input, and provides a data.frame as output. From this point forward, we’ll use this as our standard syntax when using dplyr functions.

As we’re now able to build up long chains of commands, we can also spread our code over multiple lines. Ending a line with %>% symbol let’s R know that another line is coming. This has the dual advantage of being easier to read later, and enabling us to write comments at the end of each line.

data %>% # Take our original dataset
  filter(transport == "bike") %>%  # Find the cyclists
  arrange(weight) # Arrange by weight

Getting Summaries

We can get summaries for entire columns:

data %>% summarise(mean(weight), mean(height_cm))

Or we could chain together a few commands, and provide new names for our summary columns.

data %>%
  filter(gender == "female",
         transport == "bike") %>%
  summarise(max_BMI = max(BMI), 
            mn_height = mean(height_cm))

We can use any function here that returns a single value, such as min(), max(), mean(), sd(), median() etc.

As an additionally useful feature, we can group categorical variables by their levels, and even count how many measurements we have.

data %>%
  group_by(gender, transport) %>%
  summarise(mn_weight = mean(weight), 
            mn_height =mean(height_cm),
            mn_BMI = mean(BMI),
            n = n())

As you can see, this is very similar to Excel, except we don’t have to repeat anything if new data arrives. We just load the new file, and run our code. And we can remember exactly what we’ve done, without accidentally over-writing our original data. It’s all happened in R while our original file is still unchanged on our hard-drive. This can aid with reproducible research, and also help prevent catastrophes.

Reshaping Data

Wide Form Vs Long Form

In this section we will need the pcr.csv dataset in the data folder. Let’s load the data first by placing this at the beginning of your new script file.

library(tidyverse)
pcrFile <- file.path("~", "data", "intro_r", "pcr.csv")
file.exists(pcrFile)
pcrData <- read_csv(pcrFile)

Here we have RT-PCR data from T cells as a time course for two treatments (Resting and Stim). The values are Ct values and we have three genes under investigation.

This dataset is in what we refer to as wide form where we have a row of measurements for each individual gene. The information is structured around the gene. In long form, the information would be structured around the measurement.

To perform this in R we can simply use the package tidyr which is loaded with the call to library(tidyverse). The function we’ll use is pivot_longer(), which gathers multiple columns into a single longer column.

pcrData %>% pivot_longer(-Gene, names_to = "variable", values_to = "value")

Notice that this has held the Gene column fixed but the remainder of the columns have been ‘gathered’ down the screen. The column names are in a new column called variable, and the values are in a column called value. Now you can see that we have a unique row for each measurement, containing all the information about that measurement.

By adding the cols argument -Gene we are telling pivot_longer to ignore the Gene column when gathering up the data. An alternative method would be to use ends_with("hr") because all of the columns we want to retain end with the character string “hr”. Try it if you are interested.

Let’s save that as a new object, and give those columns better names while we’re at it.

pcrLong <- pcrData %>% pivot_longer(-Gene, names_to = "Treatment", values_to = "Ct")
head(pcrLong)

The important column here is called Treatment, but it actually contains data that should be placed in two columns. We can separate this column at the _ using the function separate().

pcrLong %>%
  separate(Treatment, into = c("Treatment", "Timepoint"), sep = "_")

If we wanted to return the ‘pcrLong’ data to wide form (before separating the original Treatment variable into two variables) we can use the function pivot_wider().

pcrLong %>% pivot_wider(names_from = "Treatment", values_from = "Ct")