Transcriptomics: Lecture 1

Biotech 7005/Bioinf 3000
Frontiers of Biotechnology: Bioinformatics and Systems Modelling
The University of Adelaide

Author

Affiliation

Dr Stevie Pederson (They/Them)
stevie.pederson@thekids.org.au

Black Ochre Data Labs
The Kids Research Institute Australia

Published

September 1, 2025

Helpful Links

The slides are available here
The same notes are available as a single-page html here

Welcome To Country

I’d like to acknowledge the Kaurna people as the traditional owners and custodians of the land we know today as the Adelaide Plains, where I live & work.

I also acknowledge the deep feelings of attachment and relationship of the Kaurna people to their place.

I pay my respects to the cultural authority of Aboriginal and Torres Strait Islander peoples from other areas of Australia, and pay my respects to Elders past, present and emerging, and acknowledge any Aboriginal Australians who may be with us today

Introduction To Transcriptomics

Introduction

Postdoctoral Bioinformatician, Black Ochre Data Labs, Adelaide
Working in collaboration with members of the SA Aboriginal community
Multi-omics project to identify and address the underlying causes of high T2D rates and complications
- Using genomics, epigenomics, transcriptomics and other layers
- My focus is on the transcriptomics layer

Why Transcriptomics?

DNA can be described as being like a giant book of instructions

Some regions are defined as genes
- Originally considered to be the basic unit of inheritance
- Now commonly used to describe a region of DNA transcribed into RNA
- Recent discovery of enhancer-RNA (eRNA) muddies the water a little

By Thomas Shafee - Own work, CC BY 4.0, Wikimedia Link

Why Transcriptomics?

DNA \(\rightarrow\) mRNA \(\rightarrow\) Proteins
- Commonly referred to as the Central Dogma of Biology
Proteins are the workhorses of the cell & body
- Do most of the work, and are responsible for most of the structure
- Examples like keratin (hair), haemoglobin (oxygen transport) etc

ncRNAs are also highly functional
- Ribosomal RNA (rRNA) + tRNA \(\rightarrow\) translation from mRNA to Protein
- microRNAs play a role in gene-regulation via mRNA stability

Why Transcriptomics?

Most RNA is single-stranded but can have extremely complex structure
- Shown is a 2kb region from the lncRNA Xist (17kb in total)
Coats the entire X chromosome during X inactivation
Also interacts with the antisense lncRNA Tsix

Why Transcriptomics?

Definition

Based on Wang, Gerstein, and Snyder (2009)

The transcriptome can be defined as the complete set of (RNA) transcripts in a cell, or a population of cells, for a specific developmental stage or physiological condition

Transcriptomics is simply the study of the transcriptome
Can be the entire RNA content of a cell (or cells) or a subset of molecules (e.g. mRNA, miRNA)

Why Transcriptomics?

Is a snapshot of the dynamic biological processes associated with a biological question
Use to make inference about these processes
- Identify therapeutic targets for Cardiovascular Disease
- Biomarkers for CAR-T cells
- Key drivers of correlated gene networks
- Early drivers of neurodegeneration in Alzheimer’s
Assumed to be low-level
- DNA \(\rightarrow\) RNA \(\rightarrow\) Protein \(\rightarrow\) Metabolites, Signalling molecules, etc …

Why Transcriptomics?

Early techniques were often using large numbers of cells
- Often multiple cell types within a biological sample
Modern techniques are incredibly detailed
- Single-Cell RNA characterises exact cell types and cell trajectories
- Spatial transcriptomics used to identify co-located cells in tissue
- Identify cell-cell signalling in situ

What Is Transcription

Definition

Transcription is the process of making an RNA copy of a gene sequence

Figure taken from ¹ Licensed under CC-BY 4.0 by OpenStax

Steps of Transcription

RNA polymerase binds to the promoter along with \(\geq1\) transcription factors

RNA polymerase creates a transcription bubble
- separates the two DNA strands, breaking hydrogen bonds between complementary DNA nucleotides.
RNA polymerase adds RNA nucleotides
- complementary to the antisense DNA strand.
RNA sugar-phosphate backbone forms
Hydrogen bonds of the RNA–DNA complex break freeing the newly synthesized RNA strand.

Steps of Transcription

If the cell is a eukaryotic cell

RNA processing
- This may include polyadenylation, capping and splicing
- Occurs during (or immediately after) transcription
RNA Localisation
- The RNA may remain in the nucleus or exit to the cytoplasm through the nuclear pore complex

Eukaryotic mRNA, miRNA & snRNA transcription uses RNA Polymerase II
- RNA Pol I: rRNA
- RNA Pol III: tRNA, 5S RNA some small RNAs

Eukaryotic mRNA Processing

Nuclear mRNA have 5’ cap added
- Protects single-stranded mRNA from degradation
- Regulates nuclear export
- Promotes translation into protein

mRNAs are polyadenylated at the 3’ end
- Also protects from degradation
- Aids in transcription termination, export and translation

Introns are spliced out as required

Eukaryotic mRNA Processing

Alternate Transcripts and Isoforms

Image by the National Human Genome Research Institute

Transcriptome Resources

Reference Transcriptomes & Genomes are now commonly available
- Incorporate experimentally derived & predicted sequences + loci
Gencode² provide highest quality for mouse & human
- Release 48 (GRCh38): 78,686 genes + 385,669 transcripts

Other organisms from Ensembl, RefSeq, UCSC etc
- Zebrafish, Rat, Chicken, Drosophila, Wheat, Yeast, E. Coli etc

Sometimes we build novel transcriptomes from specific tissues
- e.g. sea snake venom gland, shiraz fruit

Early Transcriptomics

Northern Blotting

Northern blot (Alwine, Kemp, and Stark 1977) extended DNA-based methods (i.e Southern blot) \(\implies\) Earliest single-gene method

Gel Electrophoresis then hybridisation with labelled probe
- Requires some knowledge of RNA sequence
Informative for Presence/Absence calls
- Images scanned \(\rightarrow\) Densitometric Analysis for crude quantitation
Possible for some different isoforms to be detected
- Sequence dependent

RT-qPCR

The C_T values is actually estimated to a decimal value

“Gold-standard” for measurement of transcription levels
- Single gene \(\implies\) not a high-throughput technique
Targets a single transcript region with specific primers to produce cDNA
\(\rightarrow\) Polymerase Chain Reaction (PCR)
Each PCR cycle approximately doubles the target region

cDNA produced is identified using fluorophores
- Fluorescence doubles with each cycle
Once fluorescence passes a detection threshold, the cycle number is recorded
- Known as the Cycle Threshold (C_T) value

RT-qPCR

RT-qPCR

Higher C_T values \(\implies\) lower numbers of target molecule at the beginning
These can be used to estimate and compare abundance levels (i.e. gene expression)

Is vulnerable to technical artefacts (e.g. pipetting & sample variability)
Often includes one or more “housekeeper” genes thought to be stably expressed

C_T values are normalised to the housekeeper genes \(\implies \Delta C_T\)
- log₂ transformed values are used: \(\Delta C_{T_g} = \log_2 C_{T_g} - \log_2 C_{T_{hk}}\)
Comparison between conditions is the change in \(\Delta C_T \implies \Delta\Delta C_T\)
Represents change on the log₂ scale, i.e. log fold-change

Expressed Sequence Tags

The senior author on the EST paper was J Craig Ventner who played an important role in the Human Genome Project

The first attempt at capturing the larger transcriptome was ESTs (Adams et al. 1991)
Identified 609 human brain mRNA sequences
- Selected for polyA-mRNA then reverse transcribed
- Used random primers \(\rightarrow\) Sanger Sequencing
10 years before the Human Genome Project
- Gene discovery was a hot topic

Sanger Sequencing

Estevezj, CC BY-SA 3.0, via Wikimedia Commons

SAGE & CAGE

First high-throughput quantification method was Serial Analysis of Gene Expression (SAGE) (Velculescu et al. 1995)

mRNA \(\rightarrow\) cDNA using biotinylated primers
cDNA bound to beads (using biotin) & cleaved
11mer “tags” were ligated into long sequenced using linker sequences
Sequenced using Sanger Sequencing
Deconvolution & counting

Thomas Shafee, CC BY 4.0, via Wikimedia Commons

SAGE & CAGE

The terminology of counting tags is still used by some manuals & software
- Statistical models still form the basis of modern transcriptomics
Was described as Digital Gene Expression (DGE)
- The term DGE is still used but easily confused with Differential Gene Expression

A variant called Cap Analysis of Gene Expression (CAGE) targeted the 5’ Cap
Heavily used by FANTOM project (Abugessaisa et al. 2020) to identify exact Transcription Start Sites (TSS)

Microarray Technology

My search last week showed 69000 public microarray datasets in the GEO database
I reviewed a Scientific Reports submission using public array data last month

Microarrays represent the birth of modern transcriptomics
- Thousands of genes could be measured simultaneously!!!
- Tens of thousands of public datasets \(\implies\) still being mined
Established during latter stages of the Human Genome Project (1990-2003)
- Databases & complete reference sequences become widely available

All require fluorescently labelled cDNA copies of RNA
Hybridised to the array using probes for known sequences
- \(\uparrow\) fluorescence \(\implies \uparrow\) RNA abundance

Microarray Technology

All microarrays follow the same basic process

Image courtesy of Squidonius, Public domain, via Wikimedia Commons

Two Colour Arrays

Two colour microarrays were printed microscope slides
Known probe sequences were printed to the surface in defined locations
- 60-75mer oligonucleotide probes
- Highly customisable by project

Two samples per array
- Samples labelled with Cy5 (Red) or Cy3 (Green)
Scanned at 570nm (Cy3) and 670nm (Cy5)

Section of two-colour array taken from Shalon, Smith, and Brown (1996)

MA Plots

Mean of Differences
\(M = \log_2(\frac{R}{G}) = \log_2(R) - \log_2(G)\)
Average Signal
\(A = \frac{1}{2}\log_2(RG) = \frac{\log_2(R) + \log_2(G)}{2}\)

Assess bias within and between arrays
Also to show DE genes

Term “MA Plot” still used in RNA-Seq despite no connection to formula

Single Channel Arrays

Affymetrix Arrays became dominant
- Factory manufactured
Standardised layout for each organism
Single sample per array
- Only scanned at one frequency
  \(\implies\) no dye bias
More genes/array

25mer probes targeting 3’ end of transcript
- Captured only intact transcripts

No author known. Schutz assumed based on copyright claims. CC BY-SA 3.0, via Wikimedia Commons

Single Channel Arrays

The basic methodology underpinning Affymetrix array design. Source Affymetrix

3’ Arrays

Each 3’ exon targeted by 11 unique 25mer probes \(\implies\) a probeset
Possible to detect different transcripts only if 3’ exons differ

Perfect Match (PM) probes \(\implies\) exactly matches target sequence
- Known to capture off-target signal \(\implies\) non-specific binding (NSB)
3’ arrays include paired mismatch probes (MM) with a change at the 13th position
- Literally half the array
- Intended to quantify NSB properties of each probe
- Sometimes returned more signal than PM probes 🤪

Whole Transcript Arrays

Whole Transcript Arrays released by Affymetrix in mid-2000s
- Marketed as Exon Arrays and Gene Arrays
Probes along entire transcript BUT \(\leq\) 4 probes/exon

Whole Transcript Arrays

Detecting alternate isoform usage on Exon Arrays was the focus of my PhD
My method was rubbish, but still better than anything else

No successful methods for determining alternate isoform usage
- Most people reverted back to gene-level signal
- No real gains over 3’ Arrays beside more genes/array

RNA-Seq appeared at a similar time & destroyed Affymetrix’s market share
- Alternate isoform usage in RNA-seq is still considered a bit exploratory

Microarrays Vs RNA-Seq

Exon Arrays released in 2006 (dashed line)
Publications usually lag purchasing by 2-3 years
- Microarrays peaked in 2008 (i.e. 2005-6)
- Affymetrix owned by Thermo-Fisher since 2016
Microarrays continue to be extensively used in DNA-methylation analysis

Microarray Analysis

Single Channel Data

We will have multiple arrays from each condition
- Biological Replicates (hopefully \(\geq4\) per condition)
- Want to find changed expression in response to our biological hypothesis

Will some arrays have higher/lower overall signal?
- Pipetting errors, hybridisation variability etc
Two Initial Problems to solve

Adjust for overall differences in signal \(\implies\) Normalisation
Removal of Background Signal (non-specific binding + optical noise)
\(\implies\)Background Correction

Normalisation

Example of raw PM probe intensities. Taken from Bolstad et al. (2003)

The variation here is primarily technical
\(\implies\) not due to biology
Higher variance reduces power of statistical testing
Can we reduce this?

Quantile normalisation

Normalisation

Quantile normalisation is perfect for arrays with probes and probesets
- Normalise probe-level signal, but estimate gene expression at the probeset level
- Smooths out any normalisation artefacts

Select the lowest signal probe on each array
\(\rightarrow\) Likely to be a different probe on each array
Calculate the average signal across all arrays
Give each of the selected probes the average signal
Move to the next lowest signal probe until finished

Effectively randomises noise
Leads to arrays with identical distributions

Normalisation

Now we have identical distributions of signal across all arrays
Equivalent to having identical amounts of source material (mRNA)
Reduces technical noise across dataset \(\implies\) more statistical power

Background Correction

Background Correction performed simultaneously with estimation of signal
Robust Multichip Average (RMA) (Irizarry et al. 2003)
- Estimates signal for each array (\(\mu_i\))
- Model includes probe affinities (\(\alpha_j\))
- Doesn’t include MM probes
- Fitted using robust statistics to reduce impact of outlier probes

\[ \log_2 PM_{ij} = \mu_i + \alpha_j + \epsilon_{ij} \]

Extended to GC-RMA (Wu et al. 2004) to include GC content of probes

Differential Expression Analysis

A primary challenge is to detect where gene expression levels change in response to biological question
- Often control samples Vs treated samples
Microarray data is normally distributed on the log₂ scale
- Can fit standard regression models

\[ H_0: \text{No difference in average gene expression levels}\\ H_A: \text{Some difference in average gene expression levels} \]

NB: Experiments estimate the true expression level across a theoretical population

Differential Expression Analysis

\[ T = \frac{\beta}{\sigma/\sqrt{n}} \]

\(\beta\) is estimate of effect size (i.e. logFC)

Some estimates for \(\sigma\) are too low, others too high
- Too low \(\implies T \uparrow\) \(\implies\) significant result where no change
- Too high \(\implies T\downarrow\) \(\implies\) no significant result where there is change

Variance estimates moderated by taking distribution of \(\sigma\) across all genes
- Bayesian posterior estimate of variance \(\implies\) moderated t-statistic (Smyth 2004)

Differential Expression Analysis

The Bioconductor package limma is the industry standard (Smyth 2004)
- Still heavily used for modern RNA-Seq data
- Models tailored to managing variances found in transcriptomic datasets

After testing \(\rightarrow\) \(p\)-value for each gene
Multiple testing becomes an issue (revise Steven Delean’s lecture)
- Mostly use the Benjamini-Hochberg FDR

Example Results

         logFC AveExpr     t  P.Value adj.P.Val     B
Gene 1   2.180  1.0834  7.96 4.78e-05   0.00478  2.52
Gene 34  1.361  0.1829  3.12 1.44e-02   0.53387 -3.47
Gene 2   1.980  1.3416  2.90 2.02e-02   0.53387 -3.82
Gene 72  1.127 -0.0684  2.86 2.15e-02   0.53387 -3.88
Gene 75 -0.524 -0.1168 -2.72 2.67e-02   0.53387 -4.10
Gene 86  0.771 -0.1801  2.52 3.60e-02   0.59375 -4.41
Gene 16 -0.501 -0.0495 -2.43 4.16e-02   0.59375 -4.55
Gene 83  0.453 -0.1065  2.27 5.35e-02   0.65833 -4.80
Gene 33 -0.932 -0.5752 -2.15 6.39e-02   0.65833 -4.97
Gene 89  0.467 -0.1272  2.13 6.58e-02   0.65833 -5.00

Taken from the limma help page examples

Volcano Plots

Show estimated of logFC against significance (-log₁₀\(p\))
Actually taken from RNA-Seq but plots are identical

Enrichment Testing

Often want to identify key pathways associated with results
- Are genes from specific pathways or gene-sets most impacted

Gene Ontology (GO) Terms:
- Carefully constructed, hierarchical database of terms
- Biological Process, Molecular Function and Cellular Component

Enrichment Testing

Oxidative Phosphorylation pathway as annotated in the KEGG database

Kyoto Encyclopedia of Genes and Genomes (KEGG):
- Molecular pathways with complete topology

Enrichment Testing

Often use approaches based on Fisher’s Exact Test
- Compare enrichment of DE genes in a pathway vs not DE Genes in pathway
Pathways often share DE genes \(\implies\) often need biological expertise
Multiple Testing across 1000s of pathways

20% of pathway is DE vs 6.7% of “not-pathway” genes

	DE Genes	Not DE Genes	%DE
In Pathway	10	40	20%
Not In Pathway	990	14000	6.7%

Enrichment Testing

An alternative approach is to use ranked lists
- Gene Set Enrichment Analysis (GSEA)
No requirement to classify genes as DE or not DE

Walk along the ranked list of genes increasing score if gene in gene-set
- In the example, begin at the right
- Largest deviation from zero is -ve

Image taken from clusterProfiler vignette https://yulab-smu.github.io/clusterProfiler-book/chapter12.html

Closing Summary

Closing Comments

Microarray signal estimates follow a normal distribution (\(\mathcal{N}(\mu, \sigma)\)) with log-transformed
We can apply linear regression models
- In a simple A vs B experiment \(\equiv\) \(T\)-tests
Estimates of change from DE analysis often referred to as logFC (log fold-change)
Variances can be moderated for improved performance
\(p\)-values are usually FDR-adjusted
- Gives best compromise of power vs error-rate control

Closing Comments

Foundations built during the microarray era enabled analysis of RNA-Seq data
- R/Bioconductor Community
Core principles and methods developed during this era still apply
- Normalisation, DE analysis, multiple testing, enrichment testing etc
Many bioinformaticians from the microarray era are still very active
A lot of development occurred in Australia (e.g. Prof Gordon Smyth, WEHI)
- Next generation have been trained & mentored at WEHI, USyd etc
RNA-Seq data is not normally distributed \(\implies\) discrete counts not continuous fluorescence

References

Abugessaisa, Imad, Jordan A Ramilowski, Marina Lizio, Jesicca Severin, Akira Hasegawa, Jayson Harshbarger, Atsushi Kondo, et al. 2020. “FANTOM Enters 20th Year: Expansion of Transcriptomic Atlases and Functional Annotation of Non-Coding RNAs.” Nucleic Acids Research 49 (D1): D892–98. https://doi.org/10.1093/nar/gkaa1054.

Adams, Mark D., Jenny M. Kelley, Jeannine D. Gocayne, Mark Dubnick, Mihael H. Polymeropoulos, Hong Xiao, Carl R. Merril, et al. 1991. “Complementary DNA Sequencing: Expressed Sequence Tags and Human Genome Project.” Science 252 (5013): 1651–56. http://www.jstor.org/stable/2876333.

Alwine, J. C., D. J. Kemp, and G. R. Stark. 1977. “Method for detection of specific RNAs in agarose gels by transfer to diazobenzyloxymethyl-paper and hybridization with DNA probes.” Proc. Natl. Acad. Sci. U.S.A. 74 (12): 5350–54.

Bolstad, B. M., R. A. Irizarry, M. Astrand, and T. P. Speed. 2003. “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias.” Bioinformatics 19 (2): 185–93.

Dudoit, Sandrine, Yee Hwa Yang, Matthew J. Callow, and Terence P. Speed. 2002. “STATISTICAL METHODS FOR IDENTIFYING DIFFERENTIALLY EXPRESSED GENES IN REPLICATED cDNA MICROARRAY EXPERIMENTS.” Statistica Sinica 12 (1): 111–39. http://www.jstor.org/stable/24307038.

Fang, Rui, Walter N Moss, Michael Rutenberg-Schoenberg, and Matthew D Simon. 2015. “Probing Xist RNA Structure in Cells Using Targeted Structure-Seq.” PLoS Genet. 11 (12): e1005668.

Irizarry, Rafael A., Bridget Hobbs, Francois Collin, Yasmin D. Beazer‐Barclay, Kristen J. Antonellis, Uwe Scherf, and Terence P. Speed. 2003. “Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data.” Biostatistics 4 (2): 249–64. https://doi.org/10.1093/biostatistics/4.2.249.

Shafee, Thomas, and Rohan Lowe. 2017. “Eukaryotic and Prokaryotic Gene Structure.” WikiJournal of Medicine, January. https://doi.org/10.15347/WJM/2017.002.

Shalon, D, S J Smith, and P O Brown. 1996. “A DNA Microarray System for Analyzing Complex DNA Samples Using Two-Color Fluorescent Probe Hybridization.” Genome Research 6 (7): 639–45. https://doi.org/10.1101/gr.6.7.639.

Smyth, G. K. 2004. “Linear models and empirical bayes methods for assessing differential expression in microarray experiments.” Stat Appl Genet Mol Biol 3: Article3.

Velculescu, V. E., L. Zhang, B. Vogelstein, and K. W. Kinzler. 1995. “Serial analysis of gene expression.” Science 270 (5235): 484–87.

Wang, Zhong, Mark Gerstein, and Michael Snyder. 2009. “RNA-Seq: A Revolutionary Tool for Transcriptomics.” Nat. Rev. Genet. 10 (1): 57–63.

Wu, Zhijin, Rafael A Irizarry, Robert Gentleman, Francisco Martinez-Murillo, and Forrest Spencer. 2004. “A Model-Based Background Adjustment for Oligonucleotide Expression Arrays.” Journal of the American Statistical Association 99 (468): 909–17.

Footnotes

https://openoregon.pressbooks.pub/mhccbiology102/chapter/transcription/↩︎
https://www.gencodegenes.org/↩︎
Images taken from: Bolstad, Probe Level Quantile Normalization for High Density Oligonucleotide Array Data Unpublished Manuscript, 2001↩︎