- Hypothesis Testing
- Multiple Testing Considerations
29th July 2024
In biological research we often ask:
Perhaps more correctly, we are asking:
We might be comparing outcomes among treatment groups:
How do we decide if our experimental results are “important”?
Asking if the outcomes could occur by chance alone if there really were no differences in outcomes.
Every experiment is considered as a random sample from all possible repeated experiments.
Most experiments involve measuring something:
Every experiment is considered as a random sample from all possible repeated experiments.
Many data collections can also be considered as experimental data sets
In the 100,000 Genomes Project a risk allele for T1D has a frequency of \(\pi = 0.07\) in European Populations.
In our in vitro experiment, we found that 90% of HeLa cells were lysed by exposure to our drug.
All population parameters are considered to be fixed values, e.g.
All classical statistical testing involves:
Why do we do this?
An experimental hypothesis may be
\[ H_0: \mu = 0 \quad Vs \quad H_A: \mu \neq 0 \]
Where \(\mu\) represents the true average difference in a value (e.g. mRNA expression levels)
For every experiment we conduct we can get two key values:
1: The sample mean (\(\bar{x}\)) estimates the population-level mean (e.g. \(\mu\))
\[ \text{For} \quad \mathbf{x} = (x_1, x_2, ..., x_n) \] \[ \bar{x} = \frac{1}{n}\sum_{i = i}^n x_i \]
This will be a different value every time we repeat the experiment. This is an estimate of the true effect
For every experiment we conduct we can get two key values:
2: The sample variance (\(s^2\)) estimates the population-level variance (\(\sigma^2\))
\[ s^2 = \frac{1}{n-1} \sum_{i = 1}^n (x_i - \bar{x})^2 \]
This will also be a different value every time we repeat the experiment.
where \(\mu\) is the average difference in FOXP3 expression in the entire population
Now we can get the sample mean:
\[ \begin{aligned} \bar{x} &= \frac{1}{n}\sum_{i = i}^n x_i \\ &= \frac{1}{4}(2.1 + 2.8 + 2.5 + 2.6) \\ &= 2.5 \end{aligned} \]
This is our estimate of the true mean difference in expression (\(\mu\))
And the sample variance:
\[ \begin{aligned} s^2 &= \frac{1}{n - 1} \sum_{i = 1}^n (x_i - \bar{x})^2\\ & = \frac{1}{3}\sum_{i = 1}^4 (x_i - 2.5)^2\\ &= 0.0867 \end{aligned} \]
\[ \mathbf{\bar{x}} = \{\bar{x}_1, \bar{x}_2, \dots, \bar{x}_m \} \]
This represents a theoretical set of repeated experiments with a different sample mean for each.
We usually just have one experiment (\(\bar{x}\)).
\[ \mathbf{\bar{x}} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]
where:
We know what our experimental results (\(\bar{x}\)) will look like.
\[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]
If we subtract the population mean:
\[ \bar{x} - \mu \sim \mathcal{N}(0, \frac{\sigma}{\sqrt{n}}) \]
NB: We almost always test for no effect \(H_0: \mu = 0\)
\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \]
\[ H_0: \mu = 0 \quad \text{vs} \quad H_A: \mu \neq 0 \]
If \(H_0\) is true, where would we expect \(Z\) to be?
If \(H_0\) is NOT true, where would we expect \(Z\) to be?
Would a value \(Z > 1\) be likely if \(H_0\) is TRUE?
Would a value \(>2\) be likely if \(H_0\) is TRUE?
In our qPCR experiment, could the \(\Delta \Delta C_t\) values be either side of zero?
In our qPCR experiment, could the \(\Delta \Delta C_t\) values be either side of zero?
\[p(|Z| > 2) = p(Z > 2) + P(Z < -2)\]
This is the most common way of determining how much evidence we have against \(H_0\)
\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \]
A \(p\) value is the probability of observing data as extreme, or more extreme than we have observed, if \(H_0\) is true.
To summarise:
In reality, we will never know the population variance (\(\sigma^2\)), just like we will never know \(\mu\)
Due to the uncertainty introduced by using \(s^2\) instead of \(\sigma^2\) we can no longer compare to the \(Z \sim \mathcal{N}(0, 1)\) distribution.
Instead we use a \(T\)-statistic
\[ T = \frac{\bar{x} - \mu}{s / \sqrt{n}} \]
Then we compare to a \(t\)-distribution
The \(t\)-distribution is very similar to \(\mathcal{N}(0, 1)\)
At their simplest:
\[ \text{df} = \nu = n - 1 \]
As \(n\) increases \(s^2 \rightarrow \sigma^2\) and \(\implies t_{\nu} \rightarrow Z\)
For our qPCR data, testing \(\mu = 0\):
\[ \begin{aligned} T &= \frac{\bar{x} - \mu}{s / \sqrt{n}} \\ &= \frac{2.5 - 0}{0.2943920289 / \sqrt{4}} \\ &= 16.984 \end{aligned} \]
\[p = 0.00044\]
In the above we had the \(\Delta \Delta C_t\) values within each donor.
What if we just had 4 values from each cell-type from different donors?
For \(H_0: \mu_A = \mu_B\) Vs \(H_A: \mu_A \neq \mu_B\)
\[ T = \frac{\bar{x}_A - \bar{x}_B}{\text{SE}_{\bar{x}_A - \bar{x}_B}} \]
If \(H_0\) is true then
\[ T \sim t_{\nu} \]
When would data not be Normally Distributed?
Two useful tests:
\(H_0: \mu_A = \mu_B\) Vs \(H_A: \mu_A \neq \mu_B\)
An example table
A | B | |
---|---|---|
Upper Lakes | 12 | 12 |
Lower Plains | 20 | 4 |
\(H_0:\) No association between allele frequencies and location
\(H_A:\) There is an association between between allele frequencies and location
## Reject null hypothesis of no association between ## allele frequencies and loctaion
## P-value = 0.0305
A \(p\) value is the probability of observing data as (or more) extreme if \(H_0\) is true.
We commonly reject \(H_0\) if \(p < 0.05\)
How often would we incorrectly reject \(H_0\)?
A \(p\) value is the probability of observing data as (or more) extreme if \(H_0\) is true.
We commonly reject \(H_0\) if \(p < 0.05\)
How often would we incorrectly reject \(H_0\)?
About 1 in 20 times, we will see \(p < 0.05\) if \(H_0\) is true
\(H_0\) TRUE |
\(H_0\) FALSE |
|
---|---|---|
Reject \(H_0\) | Type I Error | \(\checkmark\) |
Don’t Reject \(H_0\) | \(\checkmark\) | Type II Error |
What are the consequences of each type of error?
What are the consequences of each type of error?
Type I: Waste $$$ chasing dead ends
Type II: We miss a key discovery
How many times would we incorrectly reject \(H_0\) using \(p < 0.05\)
How many times would we incorrectly reject \(H_0\) using \(p < 0.05\)
We effectively have 25,000 tests, with 24,000 times \(H_0\) is true
\(\frac{25000 - 1000}{20} = 1200\) times
Could this lead to any research dead-ends?
This is an example of the Family-Wise Error Rate (i.e. Experiment-Wise Error Rate)
The Family-Wise Error Rate (FWER) is the probability of making one or more false rejections of \(H_0\)
In our example, the FWER \(\approx 1\)
What about if we lowered the rejection value to \(\alpha = 0.001\)?
We would incorrectly reject \(H_0\) once in every 1,000 times
\(\frac{25000 - 1000}{1000} = 24\) times
The FWER is still \(\approx 1\)
What are the consequences of this?
What are the consequences of this?
Most common procedure is the Benjamini-Hochberg
What advantage would this offer?
Most common procedure is the Benjamini-Hochberg
What advantage would this offer?
For those interested, the BH procedure for \(m\) tests is (not-examinable)
Controlling the FDR at \(\alpha = 0.05\)
Rank | Pvalue | BH_Pvalue | Reject_null |
---|---|---|---|
1 | 0.0001 | 0.005 | 1 |
2 | 0.0008 | 0.010 | 1 |
3 | 0.0021 | 0.015 | 1 |
4 | 0.0234 | 0.020 | 0 |
5 | 0.0293 | 0.025 | 0 |
6 | 0.0500 | 0.030 | 0 |
7 | 0.3354 | 0.035 | 0 |
8 | 0.5211 | 0.040 | 0 |
9 | 0.9123 | 0.045 | 0 |
10 | 1.0000 | 0.050 | 0 |