Transcriptomics: Lecture 2
Frontiers of Biotechnology: Bioinformatics and Systems Modelling
The University of Adelaide
Helpful Links
I’d like to acknowledge the Kaurna people as the traditional owners and custodians of the land we know today as the Adelaide Plains, where I live & work.
I also acknowledge the deep feelings of attachment and relationship of the Kaurna people to their place.
I pay my respects to the cultural authority of Aboriginal and Torres Strait Islander peoples from other areas of Australia, and pay our respects to Elders past, present and emerging, and acknowledge any Aboriginal Australians who may be with us today
RNA-Seq
RNA Sequencing
According to Wang, Gerstein, and Snyder (2009)
RNA-Seq, also called RNA sequencing, is a particular technology-based sequencing technique which uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome.
The RNA Population Of a Eukaryotic Cell
The Key Steps
- Focus from here on will be sequencing mRNA using short reads
- Library Preparation
- RNA Quality assessment
(i.e. RNA degradation) - Selecting target molecules
- Adding sequencing primers
- RNA Quality assessment
- Sequencing
- Alignment + Quantitation
- DE Gene Detection
- Downstream Analysis
- (Optional) Nobel Prize
RNA Selection
- Select for poly-adenylated RNA using oligo-dT-based methods
- Only extracts intact mRNA with a polyA tail (includes some ncRNA)
Library Preparation
- RNA is then fragmented and size selected (200-300nt)
- Very short transcripts always lost during this step
- cDNA produced
- Sequencing adapters added
- Indexes are unique to each individual library \(\implies\) always have replicates
- Optionally contain Unique Molecular Identifiers (UMI)
\(\implies\) Helps identify PCR duplicates
- Most RNA-Seq now retains strand-of-origin information (Stranded RNA-Seq)
- During PCR only the first cDNA template retained
Sequencing
Alignment and Quantitation
Genomic Alignment
- Alignment to a reference genome requires a splice-aware aligner
- A GTF (Gene Transfer File) required when building the index
\(\implies\) Provides all exon-transcript-gene co-ordinates - New Gencode, Ensembl etc releases at regular intervals
- A GTF (Gene Transfer File) required when building the index
- Most common aligners are STAR (Dobin et al. 2013) & hisat2 (Kim et al. 2019)
- Return alignments as a
bam
file
- Return alignments as a
- Aligned reads are then counted to provide gene-level counts
- htseq (Anders, Pyl, and Huber 2014) and featureCounts (Liao, Smyth, and Shi 2014) are very common
- The same GTF should be used as during indexing
Counting Alignments
- Some alignments align beautifully within exon structures
- Some overhang a little
- Unspliced mRNA?
- Some genes are overlapping
- Stranded libraries can resolve
- Maybe bacterial reads span genes within an operon
Gene-Level Counts
- The region encoding a gene is (relatively) well defined
- An alignment within a gene is easy to assign to that gene
- Much more difficult to identify which transcript it came from
- Many transcripts share multiple exons
- Splice Junctions were the earliest approach
Transcriptome Alignment
- An alternative is to provide a reference transcriptome
- Alignments no longer need to be splice aware
- Reads can (& commonly do) align to multiple transcripts
- Much faster than traditional alignment
- Pseudo-alignment is used by kallisto (Bray et al. 2016)
- Statistically modelled expression estimates used by salmon (Patro et al. 2017)
- Return transcript-level counts without bam files
- Add transcript-level counts \(\implies\) gene-level counts
Pseudo-Counts
- Salmon counts are actually pseudo counts output by model fitting
- Predicts the proportion of library derived from transcript
- Fitted using EM-algorithm or Bayesian modelling
- Counts bootstrapped to provide uncertainty estimates of prediction
\(\implies\) Measures how confident we are in the transcript-level counts - Transcript-level counts can be scaled by uncertainty estimate (Baldoni et al. 2024) when performing DTE analysis
Differential Gene Expression Analysis
Differential Gene Expression Analysis
- A fundamental question:
Does the overall abundance of a gene differ between experimental conditions?
- Need a statistical approach to answer this question
Count-Based Data
- Under both reference-types \(\rightarrow\) counts represent expression
- These are discrete data (i.e. not continuous values)
- Microarrays were continuous values (fluorescence intensity)
- Modelled using log2-transformed values \(\implies \mathcal{N}(\mu, \sigma)\)
- Linear regression, \(t\)-tests etc
- Mean and variance are independent variables
- Count data is commonly modelled using a Poisson Distribution \(\implies \text{Poisson}(\lambda)\)
- Poisson variance is defined as being equal to the mean i.e. \(\sigma^2 = \mu\)
\(\implies\)Mean and variance are not independent variables
- Poisson variance is defined as being equal to the mean i.e. \(\sigma^2 = \mu\)
Enrichment Testing
- Often want to identify key pathways associated with results
- Are genes from specific pathways or gene-sets most impacted
- Gene Ontology (GO) Terms:
- Carefully constructed, hierarchical database of terms
- Biological Process, Molecular Function and Cellular Component
Beyond Differential Expression
Transcript Assembly
StringTie is the quick & dirty method. Will also turn up some weird artefacts
- With a good reference genome \(\implies\) StringTie can identify novel transcripts
- Un-annotated genes/lncRNA
Long Read Technology
- Most transcriptome assemblies performed using Trinity/StringTie
- Long Reads are now becoming the dominant platform
- Oxford Nanopore (ONT) reads from 50 bp to >4 Mb
- Pacific Biosciences (PacBio) up to 25kb
- Illumina maxes out around 2x150nt
- Both can sequence complete transcripts!
High Resolution Technologies
Single-Cell Transcriptomics
- Single-Cell RNA-Seq is becoming a dominant transcriptomics platform
- Conventional RNA-Seq is now sometimes called bulk RNA-Seq
- Enables insights into highly heterogeneous samples (e.g. immune cells)
- Originally polyA-selected 3’ sequences \(\rightarrow\) full-length
- Incomplete transcript capture \(\implies\) many missing genes in each cell
Spatial Transcriptomics
- Spatial Transcriptomics uses tissue slices on a microscope slide
- Each spot on the slide contains oligo-dT probes with a unique spatial barcode
- Spatial barcode helps assign reads to a tissue region
- Interaction between key cell-types can be identified
- Some techniques use Fluorescence In Situ Hybridisation (FISH) to detect transcripts
- Cell boundaries detected in combination with classic stains
- Transcript location within the cell
- Can also cluster cell-types like scRNA then map back to tissue
- Multiple tissue layers possible \(\rightarrow\) $$$
References
References
Footnotes
https://bionumbers.hms.harvard.edu/bionumber.aspx?s=n&v=5&id=100264↩︎