BIOINF-3010-7150

SARS-CoV-2 Short Read Assembly Practical

By Nathan Watson-Haigh, updated/adapted by Dave Adelson (2024)

As with previous week’s practicals, you will be using RStudio to interact with your VM.

See week 1’s practical to remind yourself how to connect to your VM.

Setup for today

Working Directory

First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This are not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life much harder for the duration of this practical.

To make and enter the directory that you will be working in, run the following commands in the terminal pane.

# Setup project working directory
mkdir --parents ~/Project_6/data/
cd ~/Project_6/

Get the Data

# Make the directory for the reference genome
mkdir --parents ~/Project_6/data/reference/

# Get the SARS-CoV-2 reference genome

cp ~/data/SARS-CoV-2_Resequencing/COVID-19.fasta.gz ~/Project_6/data/reference/

# Make the directory for the Illumina PE reads
mkdir --parents ~/Project_6/data/illumina_pe/

# Get the subsampled Illumina PE data
cp ~/data/SARS-CoV-2_Resequencing/SRR111407{44,46,48,50}_?_*x.fastq.gz ~/Project_6/data/illumina_pe/

# Make the directory for scripts
mkdir --parents ~/Project_6/data/scripts/

# Get the scripts
cp ~/data/SARS-CoV-2_Resequencing/plot_delta.R ~/Project_6/data/scripts/

Lets see what files and directories we have under our current working directory:

ls -lrthR

Today’s software

Today we will use spades a deBruijn graph based assembler and mummer a suffix tree based fast whole genome aligner.

Questions

Data

We’re going to be using some topical data, SARS-CoV-2 data, the virus causing COVID-19. We will be looking at this data in different ways, providing insights into how bioinformaticians analyse these types of data.

RefSeq Genome Assembly

We will use the SARS-CoV-2 genome assembly NC_045512.2.

This assembly was generated from a sample collected in Dec 2019 and submitted to NCBI on 13th Jan 2020 as NC_045512.1. Subsequently, NC_045512.1 was replaced by NC_045512.2 on the 17th Jan 2020.

Public Sequencing Data

Public sequence data is being released via NCBI’s SARS-CoV-2 page.

We will be looking at some data released on 21st Feb 2020 for a clinical swab obtained from a confirmed case in Madison, WI, USA. For further information see: https://openresearch.labkey.com/Coven/project-begin.view?

We will be working with the SRR11140748 sample (Illumina data) but you are welcome to also look at any of the other 3 samples if you have time. Here is a table of information linking to the orginal source of the data:

Description Accession/URL Coverage
RefSeq NC_045512.2  
Illumina    
veroSTAT-1KO SRR11140744 7,586
veroE6 SRR11140746 5,328
vero76 SRR11140748 6,353
swab SRR11140750 252

Short Read Assembly

Initial Goals

  1. Assemble a genome using a de Bruijn graph assembler (SPAdes)
  2. Visualise assembly graphs
  3. Compare assemblies against an existing reference (MUMmer)

De Novo Genome Assembly

Lets perform some de novo genome assemblies of Illumina reads. I encourage you to explore the effects of using differing amounts of coverage of the input data on the assembly and explore the assembly of the other 3 samples. We have made data files available for 10x, 20x, 40x, 80x and 100x coverage of all 4 Illumina samples. You will note that we have used time in our commands to find out how long the assemblies take. Since you will be exploring both the effect of coverage depth and k-mer length it will be instructive for you to keep track of these times as you determine what gives you the most complete, contiguous assembly.

time spades.py \
  --threads 2 \
  -o ~/Project_6/de_novo_illumina/SRR11140748_10x_PE \
  -1 ~/Project_6/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
  -2 ~/Project_6/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_6/de_novo_illumina/SRR11140748_10x_PE.log

Questions

Comparisons to Reference Genome

Compare the assembly to the SARS-CoV-2 RefSeq assembly using MUMmer’s nucmer tool:


# Decompress the reference genome for MUMmer
pigz -dcp2 \
  < ~/Project_6/data/reference/COVID-19.fasta.gz \
  > ~/Project_6/data/reference/COVID-19.fasta

# Run nucmer from MUMmer package
nucmer \
  -maxmatch \
  -minmatch 100 \
  -mincluster 500 \
  -prefix ~/Project_6/de_novo_illumina/SRR11140748_10x_PE \
  ~/Project_6/data/reference/COVID-19.fasta \
  ~/Project_6/de_novo_illumina/SRR11140748_10x_PE/contigs.fasta

This will generate a .delta file which describes the alignments between your assembled contig(s) and the reference sequence(s). While the file is plain text, it’s not meant for human consumption. Instead, we need to use some other tools to visualise the information it contains.

Visualisation of Delta Files

The online tool Assemblytics can be used to generate useful plots of the information contained within a .delta file.

Alternatively, your can use some R code to plot the information. We have provided an R script for this purpose, which can be executed as follows:

Rscript --vanilla \
  ~/Project_6/data/scripts/plot_delta.R \
  ~/Project_6/de_novo_illumina/SRR11140748_10x_PE.delta

This will create a PDF file called ~/Project_6/de_novo_illumina/SRR11140748_10x_PE.delta.pdf

Questions

Explore k-mer Length Effects

SPAdes uses several k-mer lengths during the assembly process to produce a de Bruijn graph for each k-mer length and then combines the information to produce a final assembly. Using what you learned above about exploring assemblies, lets look at how an assembly is affected by our choice of k-mer length. You may need to alter the mummer minCluster value to produce a valid .delta file.

First, lets look at an assembly produced with a single short k-mer of length of 11:

k=11

time spades.py \
  --threads 2 \
  -k ${k} \
  -o ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k} \
  -1 ~/Project_6/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
  -2 ~/Project_6/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k}.log

Next, lets look at an assembly produced with a long k-mer of length of 127:

k=127

time spades.py \
  --threads 2 \
  -k ${k} \
  -o ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k} \
  -1 ~/Project_6/data/illumina_pe/SRR11140748_1_10x.fastq.gz \
  -2 ~/Project_6/data/illumina_pe/SRR11140748_2_10x.fastq.gz \
| tee ~/Project_6/de_novo_illumina/SRR11140748_10x_PE-k${k}.log

Now modify the above commands to produce another 2 assemblies but using the 100x coverage data.

If you feel adventurous you can automate explorations of the effect of k by writing a shell script using a for loop similar to the example below.

#!/bin/bash
for k in 11 21 51 81 127
do
echo ${k}
done
exit 0

Questions