02 Statistics

4th August 2025

Lecture Outline

Hypothesis Testing
Multiple Testing Considerations

Hypothesis Testing

In biological research we often ask:

“Is something happening?” or “Is nothing happening?”

Perhaps more correctly, we are asking:

“Would it be reasonable to conclude that there is nothing clearly happening, given the results we are seeing?”

Hypothesis Testing

We might be comparing outcomes among treatment groups:

Cell proliferation in response to antibiotics in media
mRNA abundance in two related cell types
Allele frequencies in two populations
Methylation levels across genomic regions

Hypothesis Testing

How do we decide if our experimental results are “important”?

Asking if the outcomes could occur by chance alone if there really were no differences in outcomes.

Is it normal variability?
What would the data look like if our experiment had no effect?
What would our data look like if there was an effect?

Every experiment is considered as a random sample from all possible repeated experiments.

Sampling

Examples

Most experiments involve measuring something:

Discrete values e.g. read counts, number of colonies
Continuous values e.g. C_t values, fluorescence intensity

Every experiment is considered as a random sample from all possible repeated experiments.

Sampling

Examples

Many data collections can also be considered as experimental data sets

In the 100,000 Genomes Project a risk allele for T1D has a frequency of $\pi = 0.07$ in European Populations.

Does this mean, the allele occurs in exactly 7% of Europeans?

Sampling

Examples

In our in vitro experiment, we found that 90% of HeLa cells were lysed by exposure to our drug.

Does this mean that exactly 90% of HeLa cells will always be destroyed?

Population Parameters

Experimentally-obtained values represent an estimate of the true effect
More formally referred to as population-level parameters
Every experiment is considered a random sample of the complete population
Repeated experiments would give a different (but similar) estimate

All population parameters are considered to be fixed values, e.g.

Allele frequency ($\pi$) in a population
The average difference in mRNA levels

Hypothesis Testing

The Null Hypothesis

All classical statistical testing involves:

a Null Hypothesis ($H_0$) and
an Alternative Hypothesis ($H_A$)

Why do we do this?

Hypothesis Testing

The Null Hypothesis

We define $H_0$ so that we know what the data will look like if there is no effect
The alternate ($H_A$) includes every other possibility besides $H_0$

An experimental hypothesis may be

\[ H_0: \mu = 0 \quad Vs \quad H_A: \mu \neq 0 \]

Where $\mu$ represents the true average difference in a value (e.g. mRNA expression levels)

The Sample Mean

Normally Distributed Data

For every experiment we conduct we can get two key values:

1: The sample mean ($\bar{x}$) estimates the population-level mean (e.g. $\mu$)

\[ \text{For} \quad \mathbf{x} = (x_1, x_2, ..., x_n) \] \[ \bar{x} = \frac{1}{n}\sum_{i = i}^n x_i \]

This will be a different value every time we repeat the experiment. This is an estimate of the true effect

The Sample Variance

Normally Distributed Data

For every experiment we conduct we can get two key values:

2: The sample variance ($s^2$) estimates the population-level variance ($\sigma^2$)

\[ s^2 = \frac{1}{n-1} \sum_{i = 1}^n (x_i - \bar{x})^2 \]

This will also be a different value every time we repeat the experiment.

The Sample Mean

A qPCR Experiment

Comparing expression levels of FOXP3 in T_reg and T_h cells ($n = 4$ donors)
The difference within each donor is obtained as $\Delta \Delta C_t$ values
$\mathbf{x} =$ (2.1, 2.8, 2.5, 2.6)

The Sample Mean

A qPCR Experiment

Comparing expression levels of FOXP3 in T_reg and T_h cells ($n = 4$ donors)
The difference within each donor is obtained as $\Delta \Delta C_t$ values
$\mathbf{x} =$ (2.1, 2.8, 2.5, 2.6)

$H_0: \mu = 0$ Vs $H_A: \mu \neq 0$

where $\mu$ is the average difference in FOXP3 expression in the entire population

The Sample Mean

A qPCR Experiment

Now we can get the sample mean:

\[ \begin{aligned} \bar{x} &= \frac{1}{n}\sum_{i = i}^n x_i \\ &= \frac{1}{4}(2.1 + 2.8 + 2.5 + 2.6) \\ &= 2.5 \end{aligned} \]

This is our estimate of the true mean difference in expression ($\mu$)

The Sample Mean

A qPCR Experiment

And the sample variance:

\[ \begin{aligned} s^2 &= \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2\\ & = \frac{1}{3}\sum_{i = 1}^4 (x_i - 2.5)^2\\ &= 0.0867 \end{aligned} \]

The Sample Mean

Every time we repeat an experiment, we obtain a different value for $\bar{x}$
Would these follow any specific pattern?

\[ \mathbf{\bar{x}} = \{\bar{x}_1, \bar{x}_2, \dots, \bar{x}_m \} \]

This represents a theoretical set of repeated experiments with a different sample mean for each.

We usually just have one experiment ($\bar{x}$).

The Sample Mean

Every time we repeat an experiment, we obtain a different value for $\bar{x}$
Would these follow any specific pattern?
They would be normally distributed around the true value!

\[ \mathbf{\bar{x}} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]

where:

$\mu$ represents the true population mean
$\sigma$ represents the standard deviation in the population (probably unknown)
Recall that the standard deviation, $\sigma$, is the square root of the variance, $\sigma^2$

The Null Hypothesis

We know what our experimental results ($\bar{x}$) will look like.

\[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]

If we subtract the population mean:

\[ \bar{x} - \mu \sim \mathcal{N}(0, \frac{\sigma}{\sqrt{n}}) \]

NB: We almost always test for no effect $H_0: \mu = 0$

The Null Hypothesis

One final step gives us a $Z$ statistic where $Z \sim \mathcal{N}(0, 1)$

\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \]

If $H_0$ is true $Z$ will come from $\mathcal{N}(0, 1)$
If $H_0$ is NOT true $Z$ will come from some other distribution
We compare our results (i.e. $Z$) to $\mathcal{N}(0, 1)$ and see if our results are likely or unlikely
In reality we usually don’t know $\sigma$

The Null Hypothesis

We are usually testing for an unknown population value (i.e. $\mu$)
Commonly we are testing that an average difference is zero ($\mu = 0$)
The alternative is that $\mu \neq 0$
We use our experiment to draw conclusions about $\mu$

\[ H_0: \mu = 0 \quad \text{vs} \quad H_A: \mu \neq 0 \]

The Null Hypothesis

If $H_0$ is true, where would we expect $Z$ to be?
If $H_0$ is NOT true, where would we expect $Z$ to be?

The Null Hypothesis

Would a value $Z > 1$ be likely if $H_0$ is TRUE?

The Null Hypothesis

Would a value $>2$ be likely if $H_0$ is TRUE?

$p$ Values

The area under all probability distributions adds up 1
The area to the right of 2, is the probability of obtaining $Z >2$
This is 0.023
Thus if $H_0$ is true, we know the probability of obtaining a $Z$-statistic $>2$

$p$ Values

In our qPCR experiment, could the $\Delta \Delta C_t$ values be either side of zero?

$p$ Values

In our qPCR experiment, could the $\Delta \Delta C_t$ values be either side of zero?

Gene expression could go up or down after our treatment
This means we also need to check the values for the other extreme
This distribution is symmetric around zero:

\[p(|Z| > 2) = p(Z > 2) + P(Z < -2)\]

Known as a two-sided test

This is the most common way of determining how much evidence we have against $H_0$

$p$ Values

\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \]

We first calculate $Z$, and compare to $\mathcal{N}(0, 1)$
We obtain the probability of obtaining a $Z$-statistic at least as extreme as our value
- If $H_0$ is true, $Z$ will come from $\mathcal{N}(0, 1)$
- If $H_0$ is NOT true, we have no idea where $Z$ will be.
- It can be anywhere $-\infty < Z < \infty$

$p$ Values

Definition

A $p$ value is the probability of observing data as extreme, or more extreme than we have observed, if $H_0$ is true.

To summarise:

We calculate a test statistic $Z$ using $\mu = 0$ and $\sigma$
Compare this to $\mathcal{N}(0, 1)$ to find the probability ($p$) of observing data as extreme, or more extreme, than we observed if $H_0$ is true
If $p$ is low (e.g. $p<0.05$), we reject $H_0$ as unlikely and infer $H_A$

The $t$-test

$t$-tests

In reality, we will never know the population variance ($\sigma^2$), just like we will never know $\mu$

If we knew these values we wouldn’t need to do any experiments
We can use our sample variance ($s^2$) to estimate $\sigma^2$

Due to the uncertainty introduced by using $s^2$ instead of $\sigma^2$ we can no longer compare to the $Z \sim \mathcal{N}(0, 1)$ distribution.

$t$-tests

Instead we use a $T$-statistic

\[ T = \frac{\bar{x} - \mu}{s / \sqrt{n}} \]

Then we compare to a $t$-distribution

$t$-tests

The t distribution

The $t$-distribution is very similar to $\mathcal{N}(0, 1)$

Bell-shaped & symmetrical around zero
Has fatter tails $\implies$ extreme values are more likely
The parameter degrees of freedom ($\nu$) specifies how “fat” the tails are
As $\nu \rightarrow 0$ the tails get fatter

$t$-tests

The t distribution

$t$-tests

Degrees Of Freedom

At their simplest:

\[ \text{df} = \nu = n - 1 \]

As $n$ increases $s^2 \rightarrow \sigma^2$ and $\implies t_{\nu} \rightarrow Z$

$t$-tests

qPCR Data

For our qPCR data, testing $\mu = 0$:

\[ \begin{aligned} T &= \frac{\bar{x} - \mu}{s / \sqrt{n}} \\ &= \frac{2.5 - 0}{0.294392 / \sqrt{4}} \\ &= 16.984 \end{aligned} \]

$t$-tests

qPCR Data

$t$-tests

qPCR Data

Our degrees of freedom here are $\nu = 3$
By comparing to $t_3$ we get

\[p = 0.00044\]

Thus, the probability of observing data as (or more) extreme as this is very low if $H_0$ was true (1 in ~2300 experiments).
We reject $H_0$ and infer that our alternative $H_A$ is more likely

Hypothesis Testing

Summary

Define $H_0$ and $H_A$
Calculate sample mean ($\bar{x}$) and variance ($s^2$)
Calculate $T$-statistic and degrees of freedom ($\nu$)
Compare to $t_{\nu}$ and obtain probability of observing $\bar{x}$ if $H_0$ is true
Fail to reject or reject $H_0$ if $p < 0.05$ (or some other value)

Hypothesis Testing

Summary

Define $H_0$ and $H_A$
Calculate sample mean ($\bar{x}$) and variance ($s^2$)
Calculate $T$-statistic and degrees of freedom ($\nu$)
Compare to $t_{\nu}$ and obtain probability of observing $\bar{x}$ if $H_0$ is true
Fail to reject or reject $H_0$ if $p < 0.05$ (or some other value)

This applies to most situations
Assumes errors (deviations of our data around the mean) are normally distributed

Two Sample $t$-tests

In the above we had the $\Delta \Delta C_t$ values within each donor.

What if we just had 4 values from each cell-type from different donors?

We would be interested in the difference between the two means ($\mu_A$ and $\mu_B$)
We could use a two sample $t$-test to compare $\bar{x}_A$ and $\bar{x}_B$
The principle is the same, calculations are different

Two Sample $t$-tests

For $H_0: \mu_A = \mu_B$ Vs $H_A: \mu_A \neq \mu_B$

Calculate the two sample means: $\bar{x}_A$ and $\bar{x}_B$
Calculate the two sample variances $s_A^2$ and $s_B^2$
Calculate the pooled denominator: $\text{SE}_{\bar{x}_A - \bar{x}_B}$
- Formula varies for equal/unequal variances
Calculate the degrees of freedom
- Formula varies for equal/unequal variances

Two Sample $t$-tests

\[ T = \frac{\bar{x}_A - \bar{x}_B}{\text{SE}_{\bar{x}_A - \bar{x}_B}} \]

If $H_0$ is true then

\[ T \sim t_{\nu} \]

We compare our test-statistic to this distribution
Are we likely to see this value (or more extreme) under $H_0$?
Fail to reject or Reject $H_0$

Hypothesis Testing For Non-Normal Data

What if our data is not Normally Distributed

When would data not be Normally Distributed?

Counts: These are discrete whilst normal data is continuous
Proportions: These are bound at 0 & 1, i.e. $0 < \pi < 1$
Data generated by Exponential, Uniform, Binomial etc. processes

Two useful tests:

Wilcoxon Rank-Sum Test
Fisher’s Exact Test

Wilcoxon Rank-Sum Test

$H_0: \mu_A = \mu_B$ Vs $H_A: \mu_A \neq \mu_B$

Used for any two measurement groups which are not Normally Distributed.
Assigns each measurement a rank
Compares ranks between groups
Determines probability of observing differences in ranks
Also known as the Mann-Whitney Test
Requires higher sample sizes than $t$-tests

Fisher’s Exact Test

Used for $2 \times 2$ tables (or $m \times n$) with counts and categories
- Are relative proportions of one variable different depending on the value of the other variable?
Analogous to a Chi-squared ($\chi^2$) test but more robust to small values
Commonly used to test for enrichment of an event within one group above another group (GO terms; TF motifs; SNP Frequencies)

Fisher’s Exact Test

An example table

	A	B
Upper Lakes	12	12
Lower Plains	20	4

$H_0:$ No association between allele frequencies and location
$H_A:$ There is an association between between allele frequencies and location

Fisher’s Exact Test

We find the probability of obtaining tables with more extreme distributions, holding row and column totals fixed
- Sum the probabilities of each table more extreme than the observed pattern (in both directions)

## Reject null hypothesis of no association between 
## allele frequencies and loctaion

## P-value =  0.0305

Error Types and Multiple Hypothesis Testing

Evidence to Reject $H_0$

A $p$ value is the probability of observing data as (or more) extreme if $H_0$ is true.

We commonly reject $H_0$ if $p < 0.05$

How often would we incorrectly reject $H_0$?

Evidence to Reject $H_0$

A $p$ value is the probability of observing data as (or more) extreme if $H_0$ is true.

We commonly reject $H_0$ if $p < 0.05$

How often would we incorrectly reject $H_0$?

About 1 in 20 times, we will see $p < 0.05$ if $H_0$ is true

Error Types

Type I errors are when we reject $H_0$ but $H_0$ is true
Type II errors are when we fail to reject $H_0$ when $H_0$ is false

	$H_0$ `TRUE`	$H_0$ `FALSE`
Reject $H_0$	Type I Error	$\checkmark$
Don’t Reject $H_0$	$\checkmark$	Type II Error

What are the consequences of each type of error?

Error Types

What are the consequences of each type of error?

Type I: Waste $$$ chasing dead ends
Type II: We miss a key discovery

In research, we usually try to minimise Type I Errors
Increasing sample-size reduces Type II Errors

Family Wise Error Rates

Imagine we are examining every human gene ($m=$ 25,000) for differential expression using RNASeq
Imagine there are 1000 genes which are truly DE

How many times would we incorrectly reject $H_0$ using $p < 0.05$

Family Wise Error Rates

Imagine we are examining every gene ($m=$ 25,000) for differential expression using RNASeq
Imagine there are 1000 genes which are truly DE

How many times would we incorrectly reject $H_0$ using $p < 0.05$

We effectively have 25,000 tests, with 24,000 times $H_0$ is true

$\frac{25000 - 1000}{20} = 1200$ times

Could this lead to any research dead-ends?

Family Wise Error Rates

This is an example of the Family-Wise Error Rate (i.e. Experiment-Wise Error Rate)

Definition

The Family-Wise Error Rate (FWER) is the probability of making one or more false rejections of $H_0$

In our example, the FWER $\approx 1$

Family Wise Error Rates

What about if we lowered the rejection value to $\alpha = 0.001$?

We would incorrectly reject $H_0$ once in every 1,000 times

$\frac{25000 - 1000}{1000} = 24$ times

The FWER is still $\approx 1$

The Bonferroni Adjusment

If we set the rejection value to $\alpha* = \frac{\alpha}{m}$ we control the FWER at the level $\alpha$
To ensure that $p$(one or more Type I errors) = 0.05 in our example:

$\implies$ Reject $H_0$ if $p < \frac{0.05}{25000}$

What are the consequences of this?

The Bonferroni Adjusment

If we set the rejection value to $\alpha* = \frac{\alpha}{m}$ we control the FWER at the level $\alpha$
To ensure that $p$(one or more Type I errors) = 0.05 in our example:

$\implies$ Reject $H_0$ if $p < \frac{0.05}{25000}$

What are the consequences of this?

Large increase in Type II Errors
BUT what we find we are very confident about $\implies$ we don’t waste time and money on dead ends!

The False Discovery Rate

An alternative is to allow a small number of Type I Errors in our results $\implies$ we have a False Discovery Rate (FDR)
Instead of controlling the FWER at $\alpha = 0.05$, if we control the FDR at $\alpha = 0.05$ we allow up to 5% of our list to be Type I Errors

Most common procedure is the Benjamini-Hochberg

What advantage would this offer?

The False Discovery Rate

An alternative is to allow a small number of Type I Errors in our results $\implies$ we have a False Discovery Rate (FDR)
Instead of controlling the FWER at $\alpha = 0.05$, if we control the FDR at $\alpha = 0.05$ we allow up to 5% of our list to be Type I Errors

Most common procedure is the Benjamini-Hochberg

What advantage would this offer?

Lower Type II Errors
5% chance we chase a dead end

The False Discovery Rate

For those interested, the BH procedure for $m$ tests is (not-examinable)

Arrange $p$-values in ascending order $p_{(1)}, p_{(2)}, ..., p_{(m)}$
Find the largest number $k$ such that $p_{(k)} \leq \frac{k}{m}\alpha$
Reject $H_0$ for all $H_{(i)}$ where $i \leq k$

The False Discovery Rate

Controlling the FDR at $\alpha = 0.05$

Rank	Pvalue	BH_Pvalue	Reject_null
1	0.0001	0.005	1
2	0.0008	0.010	1
3	0.0021	0.015	1
4	0.0234	0.020	0
5	0.0293	0.025	0
6	0.0500	0.030	0
7	0.3354	0.035	0
8	0.5211	0.040	0
9	0.9123	0.045	0
10	1.0000	0.050	0

	\(H_0\) `TRUE`	\(H_0\) `FALSE`
Reject \(H_0\)	Type I Error	\(\checkmark\)
Don’t Reject \(H_0\)	\(\checkmark\)	Type II Error

Lecture Outline

Hypothesis Testing

Hypothesis Testing

Hypothesis Testing

Hypothesis Testing

Sampling

Examples

Sampling

Examples

Sampling

Examples

Population Parameters

Hypothesis Testing

The Null Hypothesis

Hypothesis Testing

The Null Hypothesis

The Sample Mean

Normally Distributed Data

The Sample Variance

Normally Distributed Data

The Sample Mean

A qPCR Experiment

The Sample Mean

A qPCR Experiment

The Sample Mean

A qPCR Experiment

The Sample Mean

A qPCR Experiment

The Sample Mean

The Sample Mean

The Null Hypothesis

The Null Hypothesis

The Null Hypothesis

The Null Hypothesis

The Null Hypothesis

The Null Hypothesis

\(p\) Values

\(p\) Values

\(p\) Values

\(p\) Values

\(p\) Values

\(p\) Values

Definition

The \(t\)-test

\(t\)-tests

\(t\)-tests

\(t\)-tests

The t distribution

\(t\)-tests

The t distribution

\(t\)-tests

Degrees Of Freedom

\(t\)-tests

qPCR Data

\(t\)-tests

qPCR Data

\(t\)-tests

qPCR Data

Hypothesis Testing

Summary

Hypothesis Testing

Summary

Two Sample \(t\)-tests

Two Sample \(t\)-tests

Two Sample \(t\)-tests

Hypothesis Testing For Non-Normal Data

What if our data is not Normally Distributed

Wilcoxon Rank-Sum Test

Fisher’s Exact Test

Fisher’s Exact Test

Fisher’s Exact Test

Error Types and Multiple Hypothesis Testing

Evidence to Reject \(H_0\)

Evidence to Reject \(H_0\)

Error Types

Error Types

Family Wise Error Rates

Family Wise Error Rates

Family Wise Error Rates

Definition