BIOINF-3010-7150

Ancient DNA and population genomics practical: Part 2 - Yassine Souilmi

Icons are used to highlight sections of the practicals:

Book Information
Computer Hands-on tasks
Questions Questions


Day 2: Population genomics applications

Practical outcomes

Book At the end of today’s practical, you will know how to explore contemporary and ancient genomic diversity to infer population history. The practical is loosely based on Monday’s lecture about the population history of Indigenous peoples of the Americas, in particular the Posth et al. Cell paper that was published in 2018.

Reconstructing the Deep Population History of Central and South America (Posth et al. 2018 Cell)

Book This study generated 49 new genome-wide datasets that consist of enriched Illumina high-throughput sequencing of 1.2M SNPs from ancient DNA samples. All sampled individuals are from Central (Belize) and South (Brazil, Peru, Chile, Argentina) American individuals. The skeletal remains are dated between 10,900–700 BP (years before present), with the large majority of remains older than 1,000 BP.

Book We know from previous genetic studies that Indigenous peoples of the Americas were isolated from the rest of the world since the peopling of the Americas until European colonisation during the 16th century. Thus we can safely assume that our ancient individual genomic datasets should not harbour signs of recent genetic admixture with non-Indigenous Americans.

Book Finally, previous work on two particular ancient individuals from North America informed us about the expected ancestry in Central and South America:

Book The research question we asked was relatively simple: what was the processus for the earliest population movements in South America? In particular, we were interested in determining if all contemporary populations descend from one migration event, or successive waves separated in time.

Let’s explore the datasets

Computer Activate the environment variation.

source activate popgen

Book I provided 2 datasets in EIGENSTRAT format. As a reminder, the EIGENSTRAT format consists of 3 files:

Computer Unarchive the practical data (stored in ~/data/genomics/ancient/) in your working directory.

mkdir -p ~/Project_12_2/{data,results,scripts}
cd ~/Project_12_2/
tar xvzf ~/data/ancient/tutorial_popgen.tar.gz -C data/
ll data/

Book The files with the AllAmerica_Ancient.eigenstrat prefix contain the 49 ancient Central and South Americans and data from modern South American individuals, as well as Anzick-1 and USR1. The files with the AllAmerica_Ancient_YRI.eigenstrat prefix contain the same data with the addition of one African individual (YRI: Yoruba).

Computer Using bash commands, answer the following questions.


Questions Questions

1) How many individuals are in the AllAmerica_Ancient.eigenstrat.ind dataset? 2) Is there missing data in the ancient dataset AllAmerica_Ancient.eigenstrat.geno? 3) How many SNPs in each dataset? Hint: look at the .snp files.


PCA

Computer We are going to use SMARTPCA (as part of the EIGENSOFT utilities, see the end of Tuesday’s practical) and an implementation of ADMIXTOOLS in an R package called admixR (it needs ADMIXTOOLS to be already installed). These programs are installed in the conda environment popgen.

# Make sure to copy the `$PATH` variable to `.Renviron`
echo "PATH=$PATH" >> ~/.Renviron

Computer Go to Session > Restart R.

Computer Build a parameter file named data/par.AllAmerica_Ancient.smartpca that will be one of the inputs for SMARTPCA. Because ancient data contain a lot of missing data, we are going to force SMARTPCA to construct the eigenvectors based on the contemporary populations (listed in poplistname) and then project the ancient samples onto the PCA (lsqproject).

Let us edit the file in nano

nano data/par.AllAmerica_Ancient.smartpca

Copy and Paste the following block of text into the file.

genotypename:    data/AllAmerica_Ancient.eigenstrat.geno
snpname:         data/AllAmerica_Ancient.eigenstrat.snp
indivname:       data/AllAmerica_Ancient.eigenstrat.ind
evecoutname:     results/AllAmerica_Ancient.smartpca_results.evec
evaloutname:     results/AllAmerica_Ancient.smartpca_results.eval
numoutevec:      5
lsqproject:      YES
poplistname:     data/poplistPCA

To save changed and exit editing more. Press Ctrl + x to open the exit prompt. Then press y, followed by Enter to save and exit. Computer Run SMARTPCA

smartpca -p data/par.AllAmerica_Ancient.smartpca

Computer SMARTPCA has generated two output files with the suffixes .evec (first row is the eigenvalues for the first 5 PCs, and all further rows contain the PC coordinates for each sample) and .evac (all the eigenvalues). Go to the R console and create plots.

# load libraries
library(tidyverse)
library(cowplot)

# Set your working directory
setwd("~/Project_12_2/results/")

# data for scree plot
adat.scree <- read.table("AllAmerica_Ancient.smartpca_results.eval", header = FALSE)
# Add a column with row number (only needed to be able to do a bar plot)
adat.scree$Name = 1:nrow(adat.scree)
# Rename columns
colnames(adat.scree) <- c("Scree","Name")
# Restrict to the first 10 lines (first 10 eigenvalues)
adat.scree <- head(adat.scree, 10)
# Plot scree plot data
adat.screep <- ggplot(adat.scree,aes(x = Name, y = Scree)) +
               geom_bar(stat = "identity") + # heights of the bars represent values in the data
               theme(axis.text.x = element_blank(), # no text for the x-axis
                     axis.ticks.x = element_blank()) + # no ticks for the x-axis
               xlab("component") + # x-axis label
               ylab("eigenvalue") + # y-axis label
               theme_bw() # Set theme
          
# Create dataframe for the data
adat <- read.table("AllAmerica_Ancient.smartpca_results.evec", header = FALSE)
# Reduce table to just the sample name, the first 3 PCs, and the population name
adat <- adat[,c(1:4,7)]
# Rename columns
colnames(adat) <- c("SAMPLE", "PC1", "PC2", "PC3", "POP")
# Create column for ancient vs contemporary, conditioning on the string "BP" (included in the dates) or the name of the ancient sample (Anzick and USR1)
adat$DATE <- ifelse(grepl('BP', adat$POP), "Ancient",
                    ifelse(grepl('Anzick', adat$POP), "Anzick",
                           ifelse(grepl('USR', adat$SAMPLE), "USR", "Contemporary")))
# Create column with simplified names
#adat$GROUP <- word(adat$POP, 1, sep = "_|\\.")
# Create plot with populations in different colours and super-populations with different point shapes
adat.pc12 <- ggplot(adat, aes(x = PC1, y = PC2)) + 
             geom_point(aes(fill = POP, colour = POP, shape = DATE), size = 4) +
             scale_shape_manual(values=c(21, 22, 24, 25)) + theme_bw()

# Combine plots using plot_grid from cowplot
prow <- plot_grid(adat.pc12 + theme(legend.position="none"), # remove the legend
                  adat.screep + theme(legend.position="none"), # remove the legend
                  align = 'vh', # plots are aligned vertically and horizontally
                  nrow = 1, # 2 plots on one row
                  rel_widths = c(1, .5)) # width ratio between plots is 1:0.5

# Prepare legend for the PCA plot
legend <- get_legend(adat.pc12 + 
                     guides(color = guide_legend(nrow = 5)) + # legend spans 5 lines
                     theme(legend.position = "bottom")) # legend is displayed at the bottom of the plots (i.e., horizontal not vertical)
                     
# Combine plots and legend using plot_grid from cowplot
plt1 <- plot_grid(prow, legend, ncol = 1, rel_heights = c(1, .3)) # height ratio between plots and legend is 1:0.3

# Save plot as pdf 
ggsave("AllAmerica_Ancient.smartpca_scree.pdf", plt1, dpi=300, height=10, width=14)

Questions Questions

4) The scree plot represents the value for each eigenvector, i.e., the variance in the data explained by the eigenvector. In your opinion, does the first eigenvector explain much variance compared to other vectors?
5) PC1 seems to capture the variation observed between eskimos and modern Peruvian (PEL), while PC2 seems to capture the variation just within PEL. Knowing that PEL is individuals from Lima, the capital city of Peru, why would the PEL population be so diverse?
6) Where do the ancient samples cluster in regards to the PCA coordinates? And where in regards to contemporary populations?


F3 statistics

Book We want to infer the relative divergence times between pairs of populations, and in which order they split from each other. We can use an outgroup F3 statistic by fixing the outgroup as YRI, and calculating pairwise F3 statistics between populations. The higher the F3 value, the more shared drift between the two test populations, i.e. the more related they are.

Book Using ADMIXTOOLS to compute F and D statistics can be very time consuming because the programs are not user friendly, and building the parameter files can be time consuming. Instead, we can use the R implementation admixr of ADMIXTOOLS by Martin Petr. You may want to read the very short Bioinformatics Applications Note, or better, explore the comprehensive tutorial on your own time.

Computer Load the libraries needed to run admixr (use the R console) and run F3 statistics on a subset of populations.

#OPTIONAL: Install packages if they are not readily available
#library(devtools)
#devtools::install_github("bodkan/admixr")
#install.packages("tidyverse")
library(admixr)
library(tidyverse)

# Set working dir
setwd("~/Project_12_2/")

# Load the dataset that includes the African individual
snpsAmerica <- eigenstrat(prefix = "data/AllAmerica_Ancient_YRI.eigenstrat")

# Create a list of population we want to test (just a subset of the 
pops <- c("Eskimo", "Aymara", "Anzick", "USR1", "Peru_Lauricocha_8600BP", "Peru_Lauricocha_5800BP", "Brazil_LapaDoSanto_9600BP", "Chile_LosRieles_10900BP")
result <- f3(A = pops, B = pops, C = "YRI", data = snpsAmerica)

head(result)

Book The output table contains a lot of information that we can unpack:

Computer We can also plot a heatmap to better visualise the pairwise comparisons:

# Sort the population labels according to an increasing F3 value relative to Aymara
ordered <- filter(result, A == "Aymara", B != "Aymara") %>% arrange(f3) %>% .[["B"]] %>% c("Aymara")

# Plot heatmap of pairwise F3 values
result %>%
  filter(A != B) %>%
  mutate(A = factor(A, levels = ordered),
         B = factor(B, levels = ordered)) %>%
  ggplot(aes(A, B)) + geom_tile(aes(fill = f3)) + theme_bw()

Questions Questions

7) What two populations/individuals seem to diverge earlier than the others?


D statistics

Book Now we want to know how populations compare in terms of ancestry from Anzick-1. For this, we can consider of comparing either Anzick-1 or USR1 with the different populations, using YRI as an outgroup. Since we know already that USR1 did not contribute any ancestry to South Americans, we are basically testing the proportion of Anzick-1 ancestry only.

Computer Run D statistics on a subset of populations.

# Create a list of populations we want to test (just a subset of the populations in the dataset)
pops2 <- c("Eskimo", "Aymara", "Peru_Lauricocha_5800BP", "Brazil_LapaDoSanto_9600BP", "Chile_LosRieles_10900BP")
result2 <- d(W = pops2, X = "USR1", Y = "Anzick", Z = "YRI", data = snpsAmerica)

head(result2)

Book Again, the output table contains a lot of information that we can unpack:

Computer However, a graphic representation is always easier to interpret:

# Sort the population labels according to an increasing D value and plot average and Z-score
ggplot(result2, aes(fct_reorder(W, D), D, color = abs(Zscore) > 2)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_errorbar(aes(ymin = D - 2 * stderr, ymax = D + 2 * stderr)) + 
  theme_bw()

Questions Questions

8) Is there any test population/individual for which D is not different from 0? What does it mean in terms of admixture?
9) Is there any test population/individual for which D is different from 0? Any particular pattern to report?


Book To estimate the minimum number of streams of ancestry contributing to Central and South American populations, we have used in our study the software qpWave (also implemented in admixr). qpWave assesses whether F4-statistics of the form F4(A = South American 1, B = South American 2; X = outgroup 1, Y = outgroup 2) form a matrix that is consistent with different ranks: rank 0 is consistent with a single stream of ancestry relative to the outgroups, rank 1 means 2 streams of ancestry, etc. This is how we could identify at least 3 streams of ancestry: one related to Anzick-1, 2 others related to other North American populations and never reported before.