As with previous week’s practicals, you will be using RStudio to interact with your VM.
See week 1’s practical to remind yourself how to connect to your VM.
First we will set up a directory for today’s practical. In general it is very worthwhile to keep all your project-specific code and data organised into a consistent location and structure. This is not essential, but is very useful and is in general good practice. If you don’t follow this step, you will be making your life immeasurably harder for the duration of this practical.
To make and enter the directory that you will be working in, run the following commands in the terminal pane.
# Setup project working directory
mkdir --parents ~/Project_7/data/
cd ~/Project_7/
# Make subdirectories for the various data sets
mkdir --parents data/{reference,scripts,illumina_pe,pacbio}/
# Get the data
#####
# RefSeq E. coli K-12 substr. MG1655
cp ~/data/Short_and_long_read_alignment/NC_000913.3.fasta.gz data/reference/
# Illumina PE
cp ~/data/Short_and_long_read_alignment/36_ACGCACCT-GGTGAAGG_L002_R?_001_*x.fastq.gz data/illumina_pe/
# PacBio
cp ~/data/Short_and_long_read_alignment/lima.bc1106--bc1106_*x.subreadset.fastq.gz data/pacbio/
# R script
cp ~/data/SARS-CoV-2_Resequencing/plot_delta.R data/scripts/
Using what you have learned previously, regarding read quality control, trimming and assembly tools, process the available Illumina PE and PacBio reads to assemble the E. coli K-12 substr. MG1655 genome.
Use the tool SPAdes (spades.py
) to perform a de novo assembly from BOTH Illumina and PacBio reads.
You will need to view the SPAdes documentation/manual to figure out how to specify both Illumina PE and PacBio reads in the same assembly.
As before, there are several subsamplings (2
, 4
, 5
, 8
, 10
, 20
and 40
) of the original reads for both the Illumina and PacBio data.
When performing the assembly, consider how much of each data type you might need/use to generate a good assembly in reasonable time.
HINT: time how long each spades.py
command takes to complete.
You should use what you learned about comparing an assembly against a reference genome to ascertain if an assembly is reasonable or whether more/less data might be required.
Most of these hybrid assemblies will take more than 10 mins each, so you won’t be able to do many within the time limit of this practical.
Given the limited amount of time in this practical:
Visualise one of your good assemblies in IGV or IGV-web and look to see if you can find any inconsistencies between the aligned Illumina and PacBio data. To do this you will first need to align your assembly to the reference using bwa mem. Then, load your assembly alignment, as well as the read alignments from the last practical, into IGV.
Have a look at the following regions:
NC_000913.3:4,533,722-4,554,201
NC_000913.3:378,123-379,127
Also, take another look at the regions we explored in the last practical:
NC_000913.3:276291-294244
NC_000913.3:4295777-4296810