Transcriptomics: Lecture 1
Frontiers of Biotechnology: Bioinformatics and Systems Modelling
The University of Adelaide
Helpful Links
Welcome To Country
I’d like to acknowledge the Kaurna people as the traditional owners and custodians of the land we know today as the Adelaide Plains, where I live & work.
I also acknowledge the deep feelings of attachment and relationship of the Kaurna people to their place.
I pay my respects to the cultural authority of Aboriginal and Torres Strait Islander peoples from other areas of Australia, and pay my respects to Elders past, present and emerging, and acknowledge any Aboriginal Australians who may be with us today
Introduction To Transcriptomics
Introduction
- Postdoctoral Bioinformatician, Black Ochre Data Labs, Adelaide
- Working in collaboration with members of the SA Aboriginal community
- Multi-omics project to identify and address the underlying causes of high T2D rates and complications
- Using genomics, epigenomics, transcriptomics and other layers
- My focus is on the transcriptomics layer
Why Transcriptomics?
- DNA can be described as being like a giant book of instructions
- Some regions are defined as genes
- Originally considered to be the basic unit of inheritance
- Now commonly used to describe a region of DNA transcribed into RNA
- Recent discovery of enhancer-RNA (eRNA) muddies the water a little
By Thomas Shafee - Own work, CC BY 4.0, Wikimedia Link
What Is Transcription
Definition
Transcription is the process of making an RNA copy of a gene sequence
Steps of Transcription
- RNA polymerase binds to the promoter along with \(\geq1\) transcription factors
- RNA polymerase creates a transcription bubble
- separates the two DNA strands, breaking hydrogen bonds between complementary DNA nucleotides.
- RNA polymerase adds RNA nucleotides
- complementary to the antisense DNA strand.
- RNA sugar-phosphate backbone forms
- Hydrogen bonds of the RNA–DNA complex break freeing the newly synthesized RNA strand.
Eukaryotic mRNA Processing
- Nuclear mRNA have 5’ cap added
- Protects single-stranded mRNA from degradation
- Regulates nuclear export
- Promotes translation into protein
- mRNAs are polyadenylated at the 3’ end
- Also protects from degradation
- Aids in transcription termination, export and translation
- Introns are spliced out as required
Alternate Transcripts and Isoforms
Transcriptome Resources
- Reference Transcriptomes & Genomes are now commonly available
- Incorporate experimentally derived & predicted sequences + loci
- Gencode2 provide highest quality for mouse & human
- Release 48 (GRCh38): 78,686 genes + 385,669 transcripts
- Other organisms from Ensembl, RefSeq, UCSC etc
- Zebrafish, Rat, Chicken, Drosophila, Wheat, Yeast, E. Coli etc
- Sometimes we build novel transcriptomes from specific tissues
- e.g. sea snake venom gland, shiraz fruit
Early Transcriptomics
Northern Blotting
- Northern blot (Alwine, Kemp, and Stark 1977) extended DNA-based methods (i.e Southern blot) \(\implies\) Earliest single-gene method
- Gel Electrophoresis then hybridisation with labelled probe
- Requires some knowledge of RNA sequence
- Informative for Presence/Absence calls
- Images scanned \(\rightarrow\) Densitometric Analysis for crude quantitation
- Possible for some different isoforms to be detected
- Sequence dependent
RT-qPCR
The CT values is actually estimated to a decimal value
- “Gold-standard” for measurement of transcription levels
- Single gene \(\implies\) not a high-throughput technique
- Targets a single transcript region with specific primers to produce cDNA
\(\rightarrow\) Polymerase Chain Reaction (PCR) - Each PCR cycle approximately doubles the target region
- cDNA produced is identified using fluorophores
- Fluorescence doubles with each cycle
- Once fluorescence passes a detection threshold, the cycle number is recorded
- Known as the Cycle Threshold (CT) value
Sanger Sequencing
SAGE & CAGE
- First high-throughput quantification method was Serial Analysis of Gene Expression (SAGE) (Velculescu et al. 1995)
- mRNA \(\rightarrow\) cDNA using biotinylated primers
- cDNA bound to beads (using biotin) & cleaved
- 11mer “tags” were ligated into long sequenced using linker sequences
- Sequenced using Sanger Sequencing
- Deconvolution & counting
Microarray Technology
Microarray Technology
- My search last week showed 69000 public microarray datasets in the GEO database
- I reviewed a Scientific Reports submission using public array data last month
- Microarrays represent the birth of modern transcriptomics
- Thousands of genes could be measured simultaneously!!!
- Tens of thousands of public datasets \(\implies\) still being mined
- Established during latter stages of the Human Genome Project (1990-2003)
- Databases & complete reference sequences become widely available
- All require fluorescently labelled cDNA copies of RNA
- Hybridised to the array using probes for known sequences
- \(\uparrow\) fluorescence \(\implies \uparrow\) RNA abundance
Two Colour Arrays
- Two colour microarrays were printed microscope slides
- Known probe sequences were printed to the surface in defined locations
- 60-75mer oligonucleotide probes
- Highly customisable by project
- Two samples per array
- Samples labelled with Cy5 (Red) or Cy3 (Green)
- Scanned at 570nm (Cy3) and 670nm (Cy5)
MA Plots
- Mean of Differences
\(M = \log_2(\frac{R}{G}) = \log_2(R) - \log_2(G)\) - Average Signal
\(A = \frac{1}{2}\log_2(RG) = \frac{\log_2(R) + \log_2(G)}{2}\)
- Assess bias within and between arrays
- Also to show DE genes
- Term “MA Plot” still used in RNA-Seq despite no connection to formula
Single Channel Arrays
- Affymetrix Arrays became dominant
- Factory manufactured
- Standardised layout for each organism
- Single sample per array
- Only scanned at one frequency
\(\implies\) no dye bias
- Only scanned at one frequency
- More genes/array
- 25mer probes targeting 3’ end of transcript
- Captured only intact transcripts
3’ Arrays
- Each 3’ exon targeted by 11 unique 25mer probes \(\implies\) a probeset
- Possible to detect different transcripts only if 3’ exons differ
- Perfect Match (PM) probes \(\implies\) exactly matches target sequence
- Known to capture off-target signal \(\implies\) non-specific binding (NSB)
- 3’ arrays include paired mismatch probes (MM) with a change at the 13th position
- Literally half the array
- Intended to quantify NSB properties of each probe
- Sometimes returned more signal than PM probes 🤪
Whole Transcript Arrays
- Whole Transcript Arrays released by Affymetrix in mid-2000s
- Marketed as Exon Arrays and Gene Arrays
- Probes along entire transcript BUT \(\leq\) 4 probes/exon
Microarrays Vs RNA-Seq
- Exon Arrays released in 2006 (dashed line)
- Publications usually lag purchasing by 2-3 years
- Microarrays peaked in 2008 (i.e. 2005-6)
- Affymetrix owned by Thermo-Fisher since 2016
- Microarrays continue to be extensively used in DNA-methylation analysis
Microarray Analysis
Single Channel Data
- We will have multiple arrays from each condition
- Biological Replicates (hopefully \(\geq4\) per condition)
- Want to find changed expression in response to our biological hypothesis
- Will some arrays have higher/lower overall signal?
- Pipetting errors, hybridisation variability etc
- Two Initial Problems to solve
- Adjust for overall differences in signal \(\implies\) Normalisation
- Removal of Background Signal (non-specific binding + optical noise)
\(\implies\)Background Correction
Normalisation
- The variation here is primarily technical
\(\implies\) not due to biology - Higher variance reduces power of statistical testing
- Can we reduce this?
- Quantile normalisation
Background Correction
- Background Correction performed simultaneously with estimation of signal
- Robust Multichip Average (RMA) (Irizarry et al. 2003)
- Estimates signal for each array (\(\mu_i\))
- Model includes probe affinities (\(\alpha_j\))
- Doesn’t include MM probes
- Fitted using robust statistics to reduce impact of outlier probes
\[ \log_2 PM_{ij} = \mu_i + \alpha_j + \epsilon_{ij} \]
- Extended to GC-RMA (Wu et al. 2004) to include GC content of probes
Differential Expression Analysis
- A primary challenge is to detect where gene expression levels change in response to biological question
- Often control samples Vs treated samples
- Microarray data is normally distributed on the log2 scale
- Can fit standard regression models
\[ H_0: \text{No difference in average gene expression levels}\\ H_A: \text{Some difference in average gene expression levels} \]
- NB: Experiments estimate the true expression level across a theoretical population
Example Results
logFC AveExpr t P.Value adj.P.Val B
Gene 1 2.180 1.0834 7.96 4.78e-05 0.00478 2.52
Gene 34 1.361 0.1829 3.12 1.44e-02 0.53387 -3.47
Gene 2 1.980 1.3416 2.90 2.02e-02 0.53387 -3.82
Gene 72 1.127 -0.0684 2.86 2.15e-02 0.53387 -3.88
Gene 75 -0.524 -0.1168 -2.72 2.67e-02 0.53387 -4.10
Gene 86 0.771 -0.1801 2.52 3.60e-02 0.59375 -4.41
Gene 16 -0.501 -0.0495 -2.43 4.16e-02 0.59375 -4.55
Gene 83 0.453 -0.1065 2.27 5.35e-02 0.65833 -4.80
Gene 33 -0.932 -0.5752 -2.15 6.39e-02 0.65833 -4.97
Gene 89 0.467 -0.1272 2.13 6.58e-02 0.65833 -5.00
Taken from the limma
help page examples
Volcano Plots
- Show estimated of logFC against significance (-log10\(p\))
- Actually taken from RNA-Seq but plots are identical
Enrichment Testing
- Often want to identify key pathways associated with results
- Are genes from specific pathways or gene-sets most impacted
- Gene Ontology (GO) Terms:
- Carefully constructed, hierarchical database of terms
- Biological Process, Molecular Function and Cellular Component
Closing Summary
Closing Comments
- Microarray signal estimates follow a normal distribution (\(\mathcal{N}(\mu, \sigma)\)) with log-transformed
- We can apply linear regression models
- In a simple A vs B experiment \(\equiv\) \(T\)-tests
- Estimates of change from DE analysis often referred to as logFC (log fold-change)
- Variances can be moderated for improved performance
- \(p\)-values are usually FDR-adjusted
- Gives best compromise of power vs error-rate control