In today’s practical we are trying to wean you off of detailed instructions and will expect you to be able to do basic things that have been covered before, such as renaming a file or using wget
. You should read through the practical fully before beginning so that you understand the sequence of operations and what is expected.
Just as we’ve created a new R Project
for practicals 1 to 3, let’s create a new one for today to make sure we’re all in the same place.
File
menu at the top left, select New Project
New Directory
New Project
~
, navigate to the Home directory using the Browse buttonDirectory Name
space, enter Practical_5
, then hit the Create Project button.This again helps us keep our code organised and is good practice.
Get the sequences from the Bioinf 3000 repository (https://university-of-adelaide-bx-masters.github.io/BIOTECH-7005-BIOINF-3000/Practicals/evolutionary_prac/bovidae_118_mtDNA.fa) using wget
.
Run the following command using your student number for id:
./subset -id aXXXXXXX -n 50 -in bovidae_118_mtDNA.fa > bovidae_50_mtDNA.fa
(The subset command can be obtained from here (https://university-of-adelaide-bx-masters.github.io/BIOTECH-7005-BIOINF-3000/Practicals/evolutionary_prac/subset)
using wget
.)
Remember to make subset
executable (see the bash
practical for a reminder on how to do this)
This selects 50 sequences from the input alignment according to your student number and writes them to a new file. This is not normal practice, and is only necessary because of prac time limits.
You will use the following software today: MAFFT
for making multiple alignments read about it here, Mr Bayes
for building Bayesian phylogenies read about it here, seqmagick
a comprehensive toolkit for sequence conversion/manipulation read about it here and Gblocks
a tool to remove poorly aligned regions from multiple alignments to improve the inference of phylogenies read about it here.
You will need to activate the bioinf
conda environment so that you can use Gblocks
.
source activate bioinf
In order to construct a phylogenetic tree, we need to provide positional information to the tree reconstruction program so that related positions are comparable. This is done by constructing a multiple sequence alignment. There are many programs that can be used to do this; two fast programs are MUSCLE and MAFFT, today we have installed and will be using MAFFT.
Run the following commands:
mafft bovidae_50_mtDNA.fa > bovidae_50_mtDNA-mafft.fa
Many programs unfortunately have name length limitations. One of the programs we will be using (Gblocks) cannot have long sequence identifiers and is not open source, so we cannot fix it. To get around this problem we will shorten the identifiers in a meaningful way.
Run the following command to change the sequence IDs:
sed -e 's/^>[^ ]\+ \([^ ]\+\) \([^ ]\+\).*$/>\1_\2/g' bovidae_50_mtDNA-mafft.fa > bovidae_50_mtDNA-named.fa
Q1. What does this command do?
In order to be able to examine the alignments more effectively we will convert the format from FASTA to NEXUS format. This will make it easier to see the alignment.
seqmagick convert --output-format nexus --alphabet dna bovidae_50_mtDNA-named.fa bovidae_50_mtDNA-named.nex
Look at the NEXUS alignment file. At the beginning of the alignment and near the end there are regions that have large gaps and very poor conservation.
Q2. What is the reason for this? (Hint: use the accession numbers in the name to search Entrez for the annotation.)
The phylogenetic reconstruction methods we will be using cannot handle missing bases, so these must be removed.
We can use Gblocks to remove the non-conserved regions of the alignment.
Run Gblocks
by entering Gblocks
.
You should see something like this:
******************************************************
GBLOCKS 0.91b
SELECTION OF CONSERVED BLOCKS FROM MULTIPLE ALIGNMENTS
FOR THEIR USE IN PHYLOGENETIC ANALYSIS
******************************************************
o. Open File
s. Saving Options
g. (Get Blocks)
b. Block Parameters
s. Saving Options
g. (Get Blocks)
q. Quit
Use the menu options in the program to remove the non-conserved regions from the fa
file (not the nex
file).
This will give you a file bovidae_50_mtDNA-named.fa-gb
.
Q3. Why do we need to do this? (Hint:read the Gblocks documentation)
The program we will be using to perform phylogenetic reconstruction uses a sequence (and other character) format called NEXUS. The NEXUS format is fairly widely used for phylogenetic data as it can be used to encode a variety characters, not limited to sequence data.
Use seqmagick
to convert the fa-gb
file containing the trimmed alignment blocks to a nexus format with the name bovidae_50_mtDNA-named.nex
.
Note that you will need to rename the fa-gb
file so that it has an fa
file extension, remember not to overwrite your original bovidae_50_mtDNA-named.fa
when doing so.
Look at your .nex
file to satisfy yourself that things are as they should be.
Mr Bayes is a program for performing phylogenetic tree reconstruction from multiple sequence alignments using a Bayesean inference. The Bayesean approach to reconstruction is based on using a likelihood function to estimate posterior probabilities for trees and model parameters. There are concerns in the literature about the validity of Bayesean reconstruction, but Mr Bayes provides a good environment to do the practical and Baysean inference works for many data sets.
A manual for Mr Bayes is available from here.
Start Mr Bayes by entering the following command into a terminal:
mb
This will give you a Mr Bayes prompt that should look like this.
MrBayes v3.2.6 x64
(Bayesian Analysis of Phylogeny)
Distributed under the GNU General Public License
Type "help" or "help <command>" for information
on the commands that are available.
Type "about" for authorship and general
information about the program.
MrBayes >
If you enter help
at the prompt you will get a list of topics that can be read, for example help execute
.
First you will need to load the data set that you prepared.
Use the execute
command to load the NEXUS sequence file you created with seqmagick.
Q4. How many taxa were read into memory? Q5. How many characters are being used?
For the practical you are to perform 2 reconstructions, choosing from substitution models with 1, 2, 6 or mixed. Each run takes about an hour, so you will have time to work on other parts of the practical while it is running.
First you need to set the number of substitution rate parameters used in the model. Before you do this, see what the values of parameters are by default in the likelihood setting.
help lset
You can see that the default rate matrix is 4 by 4 which corresponds to nucleotide.
Q6. What substitution models does the default number of substitution types correspond to?
Set the number of substitution types to the number you have chosen (written below as <num>
).
lset nst=<num>
The next thing to do is start the tree reconstruction analysis.
This is done using the mcmc
(Markov chain Monte Carlo) command.
Before you do this, you will need to set some options.
First examine the defaults.
help mcmc
You can see here that the default analysis runs for 1,000,000 iterations, retaining every 500th iteration for tree reconstruction. You can also see that the first 25% of samples are discarded by default.
We don’t have enough time to run for 1x10^6 generations, we want to see the convergence behaviour and we want to specify the output file name (again <num>
is the model subtitution type number), so start the analysis with the following command.
mcmc ngen=50000 relburnin=no burnin=0 filename=model-<num>
This will take 5-6 minutes (the remaining time is shown to the right of the output). You will see in the output the log likelihoods of the trees being considered (2 runs by 4 chains) in the analysis, and every 5 lines the variance between the two runs is output to give an indication of how far the trees have converged. You can see that as the run continues the average standard deviation of split frequencies decreases, showing that the two runs are converging on the same tree or similar trees.
After the MCMC has completed the 50,000 iterations, it will output the final average standard deviation of split frequencies and ask if the run is to be continued. Answer “no”.
See the Mr Bayes manual for a more complete explanation of how to decide the answer to this question.
When you stop the runs, Mr Bayes will output summary data from the analysis.
To look at the convergence, execute the following command.
sump
Q7. Describe what the output graph shows.
Find where the plateau starts by choosing different burnin lengths (note that the burnin describes the number of leading sample to discard, not the number of MCMC iterations).
Q8. Approximately when does the plateau start in terms of number of samples?
When you are happy with the burnin - that the log likelihood has plateaued - you can get the consensus tree to be calculated.
sumt
This will output a large quantity of data about the estimated paramaters and the statistical support for branch nodes, and two trees, one showing the consensus branch lengths and one showing the branch node support. It will also output a consensus tree file called “model-<num>.con.tre”.
Repeat the analysis with another model and compare the trees using the FigTree program. FigTree
is a tree visualising tool that you will need to run on your computer. Select the best download type for your computer/operating system and install it (use .dmg for Mac, use .zip for Win).
Q9. Do the trees from the two models differ? Q10. How does the branch support differ between the two trees? Q11. Do the trees agree with the known taxonomic groupings?
Answer the questions in bold above (they are worth 1 mark each).
You will have produced two trees for Bovidae (two different models), submit them as part of your practical as Figure 1 (5 marks/tree).
Produce a Bayesean tree for the whole mitochondrial genome of Marsupials.
Submit the Marsupial tree as Figure 2 (5 marks).
You can get the Marsupial mitochondrial genomes by installing the fetch
program on your VM from this location https://university-of-adelaide-bx-masters.github.io/BIOTECH-7005-BIOINF-3000/Practicals/evolutionary_prac/fetch using wget
. You will need to chmod -x
it before you can use it. You can download it to your Practical_5
subdirectory and run it from there. Fetch
is a program that allows you to search the NCBI databases and batch download the results of your search; this is much more convenient than using a browser to retrieve things one by one. In order to get the sequences from NCBI you will need to run the command:
./fetch -email <youremailaddress> -query "mitochondrion[All Fields] AND \"Metatheria\"[Organism] AND \"complete genome\"[All fields] AND \"RefSeq\"[All fields]" -out metatheria_mtDNA.fa
The server requires your email to fetch the sequences and the -out <outfile>
is required unless you want to stream the sequences to stdout
.
Alternatively, these genomes can be obtained the way a biologist would from here.
Summary
link, select FASTA (text
) this will open a tab with all of the sequences as textsource
pane.Practical_5
subdirectory as metatheria_mtDNA.fa
Q12. What two species are the best to use as an outgroup?(1 mark)
Q13. Which two leaves are the closest related/have the shortest branch?(1 mark)
Q14. What is the evolutionary distance between these two leaves?(1 mark)
Q15. Which of the following two species are more closely related?(1 mark)
1. silky-shrew oppossum
2. Tasmanian wolf
3. koala
4. platypus