In today’s practical, you will be investigating metagenomic Illumina sequencing data from a hospital wastewater sample.
For today’s practical, you will need to activate the bioinf conda environment:
source activate bioinf
Let’s create a new directory for today’s practical and create subdirectories that reflect the main steps in our analysis. This will help us stay organised.
mkdir -p ~/Practical_metagenomics/{0_raw,1_species,2_assembly,3_blast,db}
mkdir -p ~/Practical_metagenomics/0_raw/FastQC
mkdir -p ~/Practical_metagenomics/2_assembly/quast
mkdir -p ~/Practical_metagenomics/3_blast/{dbs,queries,results}
The data for today’s practical is located in ~/data/metagenomics. As in previous practicals, we will use symlinks instead of copying large data files.
# navigate to working directory
cd ~/Practical_metagenomics
# create symlinks for Illumina read files
ln -s ~/data/metagenomics/ERR11269302_[12].fastq.gz 0_raw/
# create symlink for kraken database
ln -s ~/data/dbs/kraken/std_8g db/
# create symlink for AMR gene query sequences
ln -s ~/data/metagenomics/AMR_genes.fasta 3_blast/queries/
Run FastQC on the raw Illumina reads. Be aware that the fastq files are very large and this step will take ~15 minutes.
Look at the FastQC reports and answer the following questions:
While trimming would be best-practice, we’re not going to do that today due to time and resource constraints.
There are a number of bioinformatic tools available for investigating species composition of metagenomic sequencing data. Many of these tools have high memory requirements. For this practical we’re going to use kraken2 with a scaled down database that can run on our VMs, as in previous practicals.
Run kraken2 and bracken on the Illumina reads. Be aware that this will take some time to run (~15 minutes).
Have a look at the inferred species composition in the bracken output file. Remember that you can use grep if you want to search for a particular species of interest.
Questions:
Metagenomic assembly can be very computationally intensive. As your VM only has 16 GB RAM, we’ve run the assembly for you. For your information, the assembly was generated using SPAdes with the following command, which took ~8 hours to run (DON’T TRY TO RUN THIS ON YOUR VM):
spades --meta -t 72 -m 173 -1 ERR11269302_trim_1.fastq.gz -2 ERR11269302_trim_2.fastq.gz -o ERR11269302_metaspades
Questions:
Run the following command to create a symlink to the metagenomic assembly in your directory for today’s practical:
ln -s ~/data/metagenomics/ERR11269302_metaspades_contigs.fasta 2_assembly/
Let’s start by counting the number of contigs in our metagenomic assembly:
grep -c "^>" 2_assembly/ERR11269302_metaspades_contigs.fasta
Question:
We can also see information about contig length and coverage by extracting the contig headers:
grep "^>" 2_assembly/ERR11269302_metaspades_contigs.fasta | less
Questions:
Now let’s run QUAST to look at a range of assembly metrics. We will include our E. coli isolate assembly in the QUAST report so that we can compare the isolate assembly and the metagenomic assembly directly:
quast -t 2 -o 2_assembly/quast/ERR11269302 2_assembly/ERR11269302_metaspades_contigs.fasta ~/Practical_assembly_short/2_assembly/SRR36298124/contigs.fasta
Questions:
One of the things an assembly allows us to do is to search for specific sequences of interest. Let’s have a look at whether we can find antimicrobial resistance (AMR) genes in our metagenomic dataset. We’ll begin by creating a BLAST database of our metagenomic assembly:
makeblastdb -in 2_assembly/ERR11269302_metaspades_contigs.fasta -dbtype nucl -out 3_blast/dbs/ERR11269302_metaspades
Now we can query the database using our set of AMR gene query sequences:
blastn -db 3_blast/dbs/ERR11269302_metaspades -query 3_blast/queries/AMR_genes.fasta -outfmt "7 qlen slen std" > 3_blast/results/ERR11269302_AMR.blast
Questions: