As well as forming objects like we did in the introductory session, we often import pre-existing data into R from a spreadsheet, and use this to perform an analysis. Sometimes we have to manipulate & ‘wrangle’ the data a little and as part of this. We also make plots to see what everything looks like as a common part of this process.
Today we’ll start the ball rolling by introducing a set of packages
known as the tidyverse
. This is a core set of packages
produced by the RStudio team that make working with data as painless as
possible. A brief introduction to these packages and the general
approach is summarised below.
Like most programming languages R
is very strict about
data formats. We can import .xlsx
, xls
,
csv
, txt
, gtf/gff
files + many
more file types, but understanding the structure of the imported data is
often required. For Excel files, some things we do to make it “look
nice” in the Excel program can create problems, or are just ignored:
Before We Go On
We first need to create a new script for this section. Save and close the first script we used, then create a new one
File
> New File
>
R Script
(Or Ctrl++Alt+Shift+N
)DataImport.R
library(tidyverse)
Once you’ve entered this in your script, send it to the console using one of the methods we introduced earlier. This will load several packages which contains many useful functions for the following section. You should see something like the following, which tells us which individual packages have been loaded
── Attaching packages ────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.2 ✓ purrr 0.3.4
✓ tibble 3.0.3 ✓ dplyr 1.0.0
✓ tidyr 1.1.0 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
This strategy of loading packages at the start of your script using
the command library(packageName)
, is a best practice
commonly used by R programmers. This will be helpful for any
collaborators, and for yourself when you look back at a script in 6
months.
Now we’ve gone through some formalities, let’s import some data using the Graphical User Interface (GUI) that comes with RStudio
~/data/intro_r
folder using the
Files
Tab
data
, then intro_r
transport.csv
View File
This can be used to open (small) files in the
Script Window
and view them as plain text files. When doing
this the file is not loaded into R. However, it can be
useful just to have a quick look at what the file contains. Here we can
see that we have a series of comma-separated entries, which is what we
expect for a .csv
file.
Now close the file again by clicking the cross next to the file name in the Script Window.
To import the file, or more accurately, to import the data within the
file into our R Environment
we can either:
Import Dataset
, orEnvironment
TabLet’s use the first of these methods, so click on the file and choose
Import Dataset
. This will open the Preview
Window as seen below. (If Data Preview
is empty,
click on the Update
button.)
As highlighted in the circle below, this gives us a preview of the
data as R
will see it. This is helpful to check that the
columns are as we expect. If there are problems with delimiters
(sometimes files can be labeled .csv
files but use a tab or
semi-colon as delimiters) we will spot the problem here. If the column
names are not as expected, we can also spot any problems here.
A very important frame in this preview is the code
highlighted in the bottom right corner. This is the
R
code that is being executed to import the data.
Highlight this code with your mouse/touchpad and copy it to your
clipboard.
Once you have copied this code, click Import
and the
data will pop open inside RStudio
. Before you go
any further, go to your DataImport.R
script and paste the
code directly below the line
library(tidyverse)
.
The code we copied has 3 lines:
library(readr)
readr
contains the function
read_csv()
library(tidyverse)
, so this line is a little redundant but
won’t hurt to include. Reloading a package generally does nothing.transport <- read_csv("~/data/intro_r/transport.csv")
R Environment
transport
by using the file
name without the suffix. (transport <-
)read_csv()
read_csv()
was then used to parse the
plain text into the R Environment
View(transport)
opened a preview in a familiar
Excel-like
format
R
object, and
only affect this previewtransport
in the Script Window
View(transport)
from your pasted codeWe have just loaded data using the default settings of
read_csv()
. (If we’d changed any options in the GUI, the
code would have changed.) Now we’ve saved the code in our
script, we no longer have any need the GUI to perform this operation
again.
~
symbol
~
symbol in the code above. This is
MacOS and Linux shorthand for your home directory. We’ll cover this in
more details once we get to our Bash Practicals, but is shorthand for
the directory /home/student
on your VM.
To demonstrate this:
Environment Tab
click the broom icon ()
R Environment
tidyverse
Environment Tab
again and
transport
is back!The idea that the source code is truth is at the very heart of Reproducible Research!!!
read_csv()
In the above we called the R
function
read_csv()
. To find out how the function works, we can
check the help page by typing a ?
followed by the function
name (no spaces)
?read_csv
This page lists four functions, but stick to read_csv()
This function has numerous arguments given with names like
file
, col_names
etc.
file
) or the function will not know what to do (i.e.,
there is no default value for the file
argument)col_names = TRUE
), the function uses the default
value and we don’t need to specify anything, unless over-riding the
defaults, or wishing to be explicit for anyone reading our code.arguments
for the function were defined somewhere
in the GUIFor example:
transport <- read_csv("~/data/intro_r/transport.csv")
Is equivalent to:
transport <- read_csv(file = "~/data/intro_r/transport.csv")
Setting the argument skip = 3
would then tell
read_csv()
to ignore the first 3 lines.
transport <- read_csv("path/to/file", skip = 3)
read_csv()
function can take an argument
trim_ws
. What do you think the argument
trim_ws
does?
read_csv()
When you get to the point of reading other people’s scripts, you may
sometimes come across another function read.csv()
. This is
the original function in base R, which is still in regular use, but is
inferior to read_csv()
. If you ask for help and we see
read.csv()
instead of read_csv()
, we will
definitely ask why you are trying to make your life harder!
The object transport
is a data.frame
object
(it’s actually a variant called a data_frame
or
tibble
which is a data.frame
with pretty
wrapping paper). This is the R
equivalent to a spreadsheet
and the main object type we’ll work with for the next two weeks. Three
possible ways to inspect this using the Console are:
View(transport)
transport
head(transport)
What were the differences between each method?
What if the data we have isn’t nice and clean?
This is where the additional packages from the
tidyverse
, such as dplyr
,
magrittr
, tidyr
, lubridate
and
stringr
become incredibly helpful.
By default read_csv()
assumes the names of the
variables/columns are in the first row (col_names = TRUE
).
This row is used to tell R
how many columns you have.
Let’s see what happens if we get this wrong, by loading a file without column names.
no_header <- read_csv("~/data/intro_r/no_header.csv")
no_header
We can easily fix this using the argument col_names
inside the function read_csv()
no_header <- read_csv("~/data/intro_r/no_header.csv", col_names = FALSE)
What about that first column?
That first column looks like it’s just row numbers, which are
unnecessary in R, as they are automatically included in
data.frame
(and tibble
) objects. We can
specify what is loaded or skipped using col_types
. Use the
help page for read_csv()
to try and understand what
"-ccnnc"
means.
no_header <- read_csv("~/data/intro_r/no_header.csv", col_names = FALSE, col_types = "-ccnnc")
Notice that we have specifically told R
what type of
values we have in each column. We never need to worry about this in
Excel, but it can be very useful in R
.
What if we get that wrong?
Let’s miss-specify the third column as a number instead of a text-string, i.e. “character”
no_header <- read_csv("~/data/intro_r/no_header.csv", col_names = FALSE, col_types = "-cnnnc")
Did the error message make any sense?
Did the file load?
What happened to the third column?
Many files have comments embedded within them, and these
often start with symbols like #
, >
or even
_
. These “comments” often contain important pieces of
information such as the genome build, or software versions. We can
specify the comment symbol in the argument comment =
which
tells read_csv()
to ignore lines which begin with these
characters. We’ll return to this idea later.
Let’s get it wrong first
comments <- read_csv("~/data/intro_r/comments.csv")
Did the data load as an R
object?
Now we can get it right by telling read_csv()
to
ignore any rows beginning with #
comments <- read_csv("~/data/intro_r/comments.csv", comment = "#")
What happens when you try to load the file
bad_colnames_.csv
?
bad_colnames <- read_csv("~/data/intro_r/bad_colnames.csv")
Here’s my solution, where I’ve set the column names
manually.
bad_colnames <- read_csv("~/data/intro_r/bad_colnames.csv", skip = 1, col_names = FALSE)
colnames(bad_colnames)
colnames(bad_colnames) <- c("rowname", "gender", "name", "weight", "height", "transport")
There are two important new concepts here.
colnames()
to call the column
names, or if we follow the function with the assignment operator
<-
, R
will place the subsequent data into
the column names of the data.frame
c()
, which is one
of the most common functions in R
. This stands for
combine
, and combines all values within the brackets
()
into a single R
object, or
vector
. If left empty, it is equivalent to a
NULL
or empty object.c()
We’ll see this function a great deal when using R
.
What if missing values have been set to “-” or some other value in our csv file?
Let’s get it wrong first:
missing_data <- read_csv("~/data/intro_r/missing_data.csv")
weight
column is no longer numeric, but appears to be a
column of text (characters)
Now we can get it right using both the na
and
col_types
arguments.
missing_data <- read_csv("~/data/intro_r/missing_data.csv", na = "-", col_types = "-ccnnc")