As with previous week’s practicals, you will be using RStudio to interact with your VM.
See week 1’s practical to remind yourself how to connect to your VM.
First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This are not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life much harder for the duration of this practical.
To make and enter the directory that you will be working in, run the following commands in the terminal pane.
# Setup project working directory
mkdir --parents ~/Project_6/data/
cd ~/Project_6/
# Make the directory for the reference genome
mkdir --parents ~/Project_6/data/reference/
# Get the SARS-CoV-2 reference genome
cp ~/data/SARS-CoV-2_Resequencing/COVID-19.fasta.gz ~/Project_6/data/reference/
# Make the directory for the Illumina PE reads
mkdir --parents ~/Project_6/data/illumina_pe/
# Get the subsampled Illumina PE data
cp ~/data/SARS-CoV-2_Resequencing/SRR111407{44,46,48,50}_?_*x.fastq.gz ~/Project_6/data/illumina_pe/
# Make the directory for scripts
mkdir --parents ~/Project_6/data/scripts/
# Get the scripts
cp ~/data/SARS-CoV-2_Resequencing/plot_delta.R ~/Project_6/data/scripts/
Lets see what files and directories we have under our current working directory:
ls -lrthR
Today we will use spades
a deBruijn graph based assembler and mummer
a suffix tree based fast whole genome aligner.
{44,46,48,50}
mean/do in the above cp
command while grabbing the Illumina PE data?We’re going to be using some topical data, SARS-CoV-2 data, the virus causing COVID-19. We will be looking at this data in different ways, providing insights into how bioinformaticians analyse these types of data.
We will use the SARS-CoV-2 genome assembly NC_045512.2.
This assembly was generated from a sample collected in Dec 2019 and submitted to NCBI on 13th Jan 2020 as NC_045512.1. Subsequently, NC_045512.1 was replaced by NC_045512.2 on the 17th Jan 2020.
Public sequence data is being released via NCBI’s SARS-CoV-2 page.
We will be looking at some data released on 21st Feb 2020 for a clinical swab obtained from a confirmed case in Madison, WI, USA. For further information see: https://openresearch.labkey.com/Coven/project-begin.view?
We will be working with the SRR11140748
sample (Illumina data) but you are welcome to also look at any of the other 3 samples if you have time.
Here is a table of information linking to the orginal source of the data:
Description | Accession/URL | Coverage |
---|---|---|
RefSeq | NC_045512.2 | |
Illumina | ||
veroSTAT-1KO | SRR11140744 | 7,586 |
veroE6 | SRR11140746 | 5,328 |
vero76 | SRR11140748 | 6,353 |
swab | SRR11140750 | 252 |
Lets perform some de novo genome assemblies of Illumina reads.
I encourage you to explore the effects of using differing amounts of coverage of the input data on the assembly and explore the assembly of the other 3 samples.
We have made data files available for 10x
, 20x
, 40x
, 80x
and 100x
coverage of all 4 Illumina samples. You will note that we have used time
in our commands to find out how long the assemblies take. Since you will be exploring both the effect of coverage depth and k-mer
length it will be instructive for you to keep track of these times as you determine what gives you the most complete, contiguous assembly.
time spades.py \
--threads 2 \
-o ~/Project_6/de_novo_illumina/SRR11140748_10x_PE \
-1 ~/Project_6/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
-2 ~/Project_6/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_6/de_novo_illumina/SRR11140748_10x_PE.log
tee
in the above pipeline; what did it do?Compare the assembly to the SARS-CoV-2 RefSeq assembly using MUMmer’s nucmer
tool:
# Decompress the reference genome for MUMmer
pigz -dcp2 \
< ~/Project_6/data/reference/COVID-19.fasta.gz \
> ~/Project_6/data/reference/COVID-19.fasta
# Run nucmer from MUMmer package
nucmer \
-maxmatch \
-minmatch 100 \
-mincluster 500 \
-prefix ~/Project_6/de_novo_illumina/SRR11140748_10x_PE \
~/Project_6/data/reference/COVID-19.fasta \
~/Project_6/de_novo_illumina/SRR11140748_10x_PE/contigs.fasta
This will generate a .delta
file which describes the alignments between your assembled contig(s) and the reference sequence(s).
While the file is plain text, it’s not meant for human consumption.
Instead, we need to use some other tools to visualise the information it contains.
The online tool Assemblytics can be used to generate useful plots of the information contained within a .delta
file.
Alternatively, your can use some R code to plot the information. We have provided an R script for this purpose, which can be executed as follows:
Rscript --vanilla \
~/Project_6/data/scripts/plot_delta.R \
~/Project_6/de_novo_illumina/SRR11140748_10x_PE.delta
This will create a PDF file called ~/Project_6/de_novo_illumina/SRR11140748_10x_PE.delta.pdf
SPAdes uses several k-mer lengths during the assembly process to produce a de Bruijn graph for each k-mer length and then combines the information to produce a final assembly.
Using what you learned above about exploring assemblies, lets look at how an assembly is affected by our choice of k-mer length. You may need to alter the mummer
minCluster
value to produce a valid .delta
file.
First, lets look at an assembly produced with a single short k-mer of length of 11:
k=11
time spades.py \
--threads 2 \
-k ${k} \
-o ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k} \
-1 ~/Project_6/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
-2 ~/Project_6/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k}.log
Next, lets look at an assembly produced with a long k-mer of length of 127:
k=127
time spades.py \
--threads 2 \
-k ${k} \
-o ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k} \
-1 ~/Project_6/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
-2 ~/Project_6/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k}.log
Now modify the above commands to produce another 2 assemblies but using the 100x
coverage data.
If you feel adventurous you can automate explorations of the effect of k
by writing a shell script using a for
loop similar to the example below.
#!/bin/bash
for k in 11 21 51 81 127
do
echo ${k}
done
exit 0