BIOINF-3010-7150

SARS-CoV-2 Resequencing

By Nathan Watson-Haigh, updated/adapted by Dave Adelson

As with previous week’s practicals, you will be using RStudio to interact with your VM.

See week 1’s practical to remind yourself how to connect to your VM.

Setup for today

Today you are expected to be able to follow the instructions on your own. There will be a brief introduction by Dave, but after that you will need to work through the instructions. Try to work out for yourselves the answers to the questions before asking for help. If you are really stuck and can’t figure out how to get unstuck, ask for help.

Working Directory

First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This is not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life immeasurably harder for the duration of this practical.

To make and enter the directory that you will be working in, run the following commands in the terminal pane.

# Setup project working directory
mkdir --parents ~/Project_3/data/
cd ~/Project_3/

Software packages

You will use samtools, bwa and minimap2 today. minimap2 uses the conda environment which you need to activate with the following command.

source activate bioinf

Data

For the next couple of weeks we’re going to be using SARS-CoV-2 data, the virus causing COVID-19. We will be looking at this data in different ways, providing insights into how bioinformaticians analyses these types of data.

RefSeq Genome Assembly

The SARS-CoV-2 genome assembly we will use is one of the earliest ones generated NC_045512.2.

This assembly was generated from a sample collected in Dec 2019 and submitted to NCBI on 13th Jan 2020 as NC_045512.1. Subsequently, NC_045512.1 was replaced by NC_045512.2 on the 17th Jan 2020.

Question

Public Sequencing Data

Public sequence data is being released via NCBI’s SARS-CoV-2 page.

We will be looking at some data released on 21st Feb 2020 for a clinical swab obtained from a confirmed case in Madison, WI, USA. For further information see: https://openresearch.labkey.com/Coven/project-begin.view?

There is both Illumina and Oxford Nanopore data available for several samples. We will be working with the SRR11140748 sample but you are welcome to also look at any of the other 3 samples if you have time. Here is a table of information linking to the orginal source of the data:

Description Accession/URL Coverage
RefSeq NC_045512.2  
Illumina    
veroSTAT-1KO SRR11140744 7,586
veroE6 SRR11140746 5,328
vero76 SRR11140748 6,353
swab SRR11140750 252
Nanopore    
veroSTAT-1KO SRR11140745 8,727
veroE6 SRR11140747 12,006
vero76 SRR11140749 10,170
swab SRR11140751 792

We have also provided access to a subsample (~100x coverage) of the above data sets to facilitate a more timely analysis and improve data visualisation responsiveness.

Get the Data

# Make the directory for the Illumina PE reads
mkdir --parents ~/Project_3/data/illumina_pe/

# Get the Illumina data for the SRR11140748 sample
cp ~/data/SARS-CoV-2_Resequencing/SRR11140748_?_100x.fastq.gz ~/Project_3/data/illumina_pe/

Question

Genome Resequencing

Genome resequencing is performed when a good quality genome assembly already exists for a species. We generate sequence data, most commonly Illumina reads, for one or more samples and align them to the reference genome. We can then interrogate the read alignments and ask questions about how our sample differs from the reference genome and what the significance of those differences might be.

Initial Goals

  1. Gain familiarity with aligning reads to a reference genome
  2. Gain familiarity with exploring read alignments using IGV, a genome browser
  3. Gain familiarity with Oxford Nanopore long read data

FastQC

Using what you learned last week, run FastQC across the raw data for the SRR11140748 sample and inspect the FastQC reports.

Questions

Quality and Adapter Trimming

Using what you learned last week and what you found in the FastQC reports generated above, processes the reads through trimmomatic or fastp if you deem it necessary.

Once you are happy the reads are of good quality, lets see how much data we have:

pigz --decompress --to-stdout --processes 2 \
  ~/Project_3/data/illumina_pe/SRR11140748_?_100x.fastq.gz \
| sed -n '2~4p' \
| awk 'BEGIN{OFS="\t"}{z+=length($0); y+=1}END{print z,y,z/y}'

Questions

Read Alignment

There are many tools for performing Illumina PE read alignment (aka mapping) against a genome. Today we will use a tool called BWA. Conceptually, a read aligner aims to find the most likely place in the genome from which the read was generated. Think of this as looking up a persons name in a phonebook.

Questions

Would it be easier to search for a name if:

Tab index

Indexing the Reference Genome

Having the information stored in a certain way and having an index makes looking up names in the phonebook much faster. In the context of aligning reads to a reference genome, having an “index” of the reference genome speeds up the process of finiding and aligning reads to the genome. The index of a genome is often tool-specific so we often need to create a different index for each tool we plan to use.

Like sorting and indexing a phonebook, this process is time consuming. However, since it’s only done once, and it provides significant speed-ups in read alignment, it is time well spent!

Index the reference genome:

# Make the directory for the reference genome
mkdir --parents ~/Project_3/data/reference/

# Get the SARS-CoV-2 reference genome
cp ~/data/SARS-CoV-2_Resequencing/COVID-19.fasta.gz ~/Project_3/data/reference/

# Index the reference genome for use with BWA
bwa index ~/Project_3/data/reference/COVID-19.fasta.gz

Questions

Align the Illumina Reads

Align the Illumina PE reads to the reference genome:

# Create an output directory ahead of time
mkdir --parents ~/Project_3/resequencing/

# Align reads
time bwa mem \
  ~/Project_3/data/reference/COVID-19.fasta.gz \
  ~/Project_3/data/illumina_pe/SRR11140748_1_100x.fastq.gz \
  ~/Project_3/data/illumina_pe/SRR11140748_2_100x.fastq.gz \
> resequencing/SRR11140748_illumina.sam

# Coordinate sort SAM file, and output to BAM
time samtools sort \
  -o resequencing/SRR11140748_illumina.bam \
  resequencing/SRR11140748_illumina.sam

Questions

BAM File Visualisation

We are going to load the SARS-CoV-2 genome into a genome browser called IGV and then load the BAM file. First, we need to create an index for the reference genome and an index file of the BAM file. This ensures IGV can quickly load the reference sequence and corresponding read alignments for any coordinates we choose to navigate to.

# IGV-web requires an uncompressed reference sequence 
# and corresponding index file
gunzip \
  < ~/Project_3/data/reference/COVID-19.fasta.gz \
  > ~/Project_3/data/reference/COVID-19.fasta

# Generate index of the FASTA-formatted genome file
samtools faidx ~/Project_3/data/reference/COVID-19.fasta

# Index the BAM file
samtools index resequencing/SRR11140748_illumina.bam

Download the following files to your local computer using RStudio’s File browser. Simply select 1 file at a time by checking the checkbox and click “More” » “Export…”. Click the “Download” button and save it somewhere obvious.

Visit, IGV-web and load the genome from a Local File ... by selecting both the COVID-19.fasta and COVID-19.fasta.fai files. Once the reference genome is loaded, load a “Track” from a Local File ... by selecting both the SRR11140748_illumina.bam and SRR11140748_illumina.bam.bai files.

IGV-web SRR11140748 Illumina

Questions

Align the Oxford Nanopore Reads

Long reads have the potential for spanning “repeat” regions, untangling the mess that short reads make in these areas, provide evidence of structural variation etc. Lets align the long reads to the reference genome.

# Activate the bioinf conda environment for minimap2
source activate bioinf

# Index the reference genome for minimap2
minimap2 \
  -d ~/Project_3/data/reference/COVID-19.fasta.gz.mmi \
  ~/Project_3/data/reference/COVID-19.fasta.gz

# Get the Nanopore reads
mkdir -p ~/Project_3/data/nanopore/
cp ~/data/SARS-CoV-2_Resequencing/SRR11140749_1_100x.fastq.gz ~/Project_3/data/nanopore/

# Align the reads
time minimap2 \
  -ax map-ont \
  -t 2 \
  ~/Project_3/data/reference/COVID-19.fasta.gz.mmi \
  ~/Project_3/data/nanopore/SRR11140749_1_100x.fastq.gz \
| samtools sort --threads 2 -o resequencing/SRR11140749_nanopore.bam

# Index the BAM file
samtools index resequencing/SRR11140749_nanopore.bam

Load the nanopore read alignment file into your IGV-web.

IGV-web SRR11140749 Nanopore

Questions

External Resources