As with previous week’s practicals, you will be using RStudio to interact with your VM.
See week 1’s practical to remind yourself how to connect to your VM.
Today you are expected to be able to follow the instructions on your own. There will be a brief introduction by Dave, but after that you will need to work through the instructions. Try to work out for yourselves the answers to the questions before asking for help. If you are really stuck and can’t figure out how to get unstuck, ask for help.
First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This is not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life immeasurably harder for the duration of this practical.
To make and enter the directory that you will be working in, run the following commands in the terminal pane.
# Setup project working directory
mkdir --parents ~/Project_3/data/
cd ~/Project_3/
You will use samtools
, bwa
and minimap2
today. minimap2 uses the conda
environment which you need to activate with the following command.
source activate bioinf
For the next couple of weeks we’re going to be using SARS-CoV-2 data, the virus causing COVID-19. We will be looking at this data in different ways, providing insights into how bioinformaticians analyses these types of data.
The SARS-CoV-2 genome assembly we will use is one of the earliest ones generated NC_045512.2.
This assembly was generated from a sample collected in Dec 2019 and submitted to NCBI on 13th Jan 2020 as NC_045512.1. Subsequently, NC_045512.1 was replaced by NC_045512.2 on the 17th Jan 2020.
Public sequence data is being released via NCBI’s SARS-CoV-2 page.
We will be looking at some data released on 21st Feb 2020 for a clinical swab obtained from a confirmed case in Madison, WI, USA. For further information see: https://openresearch.labkey.com/Coven/project-begin.view?
There is both Illumina and Oxford Nanopore data available for several samples.
We will be working with the SRR11140748
sample but you are welcome to also look at any of the other 3 samples if you have time.
Here is a table of information linking to the orginal source of the data:
Description | Accession/URL | Coverage |
---|---|---|
RefSeq | NC_045512.2 | |
Illumina | ||
veroSTAT-1KO | SRR11140744 | 7,586 |
veroE6 | SRR11140746 | 5,328 |
vero76 | SRR11140748 | 6,353 |
swab | SRR11140750 | 252 |
Nanopore | ||
veroSTAT-1KO | SRR11140745 | 8,727 |
veroE6 | SRR11140747 | 12,006 |
vero76 | SRR11140749 | 10,170 |
swab | SRR11140751 | 792 |
We have also provided access to a subsample (~100x coverage) of the above data sets to facilitate a more timely analysis and improve data visualisation responsiveness.
# Make the directory for the Illumina PE reads
mkdir --parents ~/Project_3/data/illumina_pe/
# Get the Illumina data for the SRR11140748 sample
cp ~/data/SARS-CoV-2_Resequencing/SRR11140748_?_100x.fastq.gz ~/Project_3/data/illumina_pe/
?
is doing in the above command?Genome resequencing is performed when a good quality genome assembly already exists for a species. We generate sequence data, most commonly Illumina reads, for one or more samples and align them to the reference genome. We can then interrogate the read alignments and ask questions about how our sample differs from the reference genome and what the significance of those differences might be.
Using what you learned last week, run FastQC across the raw data for the SRR11140748
sample and inspect the FastQC reports.
Using what you learned last week and what you found in the FastQC reports generated above, processes the reads through trimmomatic
or fastp
if you deem it necessary.
Once you are happy the reads are of good quality, lets see how much data we have:
pigz --decompress --to-stdout --processes 2 \
~/Project_3/data/illumina_pe/SRR11140748_?_100x.fastq.gz \
| sed -n '2~4p' \
| awk 'BEGIN{OFS="\t"}{z+=length($0); y+=1}END{print z,y,z/y}'
pigz
command doing?sed
command is doing? Hint: try replacing the awk
command with a simple less
command.awk
command outputs a tab-delimited file consisting of 3 columns of numbers. What do the numbers represent?awk
script are the variables z
and y
. These are not very descriptive of what values they actually hold. Rewrite the script, making these variable names more self-explainitory.SRR11140748
sample?There are many tools for performing Illumina PE read alignment (aka mapping) against a genome. Today we will use a tool called BWA. Conceptually, a read aligner aims to find the most likely place in the genome from which the read was generated. Think of this as looking up a persons name in a phonebook.
Would it be easier to search for a name if:
Having the information stored in a certain way and having an index makes looking up names in the phonebook much faster. In the context of aligning reads to a reference genome, having an “index” of the reference genome speeds up the process of finiding and aligning reads to the genome. The index of a genome is often tool-specific so we often need to create a different index for each tool we plan to use.
Like sorting and indexing a phonebook, this process is time consuming. However, since it’s only done once, and it provides significant speed-ups in read alignment, it is time well spent!
Index the reference genome:
# Make the directory for the reference genome
mkdir --parents ~/Project_3/data/reference/
# Get the SARS-CoV-2 reference genome
cp ~/data/SARS-CoV-2_Resequencing/COVID-19.fasta.gz ~/Project_3/data/reference/
# Index the reference genome for use with BWA
bwa index ~/Project_3/data/reference/COVID-19.fasta.gz
bwa index
command? Hint: look at the contents of ~/Project_2/data/reference/
Align the Illumina PE reads to the reference genome:
# Create an output directory ahead of time
mkdir --parents ~/Project_3/resequencing/
# Align reads
time bwa mem \
~/Project_3/data/reference/COVID-19.fasta.gz \
~/Project_3/data/illumina_pe/SRR11140748_1_100x.fastq.gz \
~/Project_3/data/illumina_pe/SRR11140748_2_100x.fastq.gz \
> resequencing/SRR11140748_illumina.sam
# Coordinate sort SAM file, and output to BAM
time samtools sort \
-o resequencing/SRR11140748_illumina.bam \
resequencing/SRR11140748_illumina.sam
bwa mem
and samtools sort
commands have options for using multiple threads. Figure out what command line argument to use to change the number of threads to 2
.bwa mem
and samtools sort
commands in a pipeline so as to avoid writing the large intermediary SAM file?We are going to load the SARS-CoV-2 genome into a genome browser called IGV and then load the BAM file. First, we need to create an index for the reference genome and an index file of the BAM file. This ensures IGV can quickly load the reference sequence and corresponding read alignments for any coordinates we choose to navigate to.
# IGV-web requires an uncompressed reference sequence
# and corresponding index file
gunzip \
< ~/Project_3/data/reference/COVID-19.fasta.gz \
> ~/Project_3/data/reference/COVID-19.fasta
# Generate index of the FASTA-formatted genome file
samtools faidx ~/Project_3/data/reference/COVID-19.fasta
# Index the BAM file
samtools index resequencing/SRR11140748_illumina.bam
Download the following files to your local computer using RStudio’s File browser. Simply select 1 file at a time by checking the checkbox and click “More” » “Export…”. Click the “Download” button and save it somewhere obvious.
~/Project_3/data/reference/COVID-19.fasta
~/Project_3/data/reference/COVID-19.fasta.fai
~/Project_3/resequencing/SRR11140748_illumina.bam
~/Project_3/resequencing/SRR11140748_illumina.bam.bai
Visit, IGV-web and load the genome from a Local File ...
by selecting both the COVID-19.fasta
and COVID-19.fasta.fai
files.
Once the reference genome is loaded, load a “Track” from a Local File ...
by selecting both the SRR11140748_illumina.bam
and SRR11140748_illumina.bam.bai
files.
Long reads have the potential for spanning “repeat” regions, untangling the mess that short reads make in these areas, provide evidence of structural variation etc. Lets align the long reads to the reference genome.
# Activate the bioinf conda environment for minimap2
source activate bioinf
# Index the reference genome for minimap2
minimap2 \
-d ~/Project_3/data/reference/COVID-19.fasta.gz.mmi \
~/Project_3/data/reference/COVID-19.fasta.gz
# Get the Nanopore reads
mkdir -p ~/Project_3/data/nanopore/
cp ~/data/SARS-CoV-2_Resequencing/SRR11140749_1_100x.fastq.gz ~/Project_3/data/nanopore/
# Align the reads
time minimap2 \
-ax map-ont \
-t 2 \
~/Project_3/data/reference/COVID-19.fasta.gz.mmi \
~/Project_3/data/nanopore/SRR11140749_1_100x.fastq.gz \
| samtools sort --threads 2 -o resequencing/SRR11140749_nanopore.bam
# Index the BAM file
samtools index resequencing/SRR11140749_nanopore.bam
Load the nanopore read alignment file into your IGV-web.
NC_045512.2:17373
and NC_045512.2:20299
?