In the lectures we discussed:
Every experiment is a random sample of the complete set of all possible samples
We define a null hypothesis (\(H_0\)) so we can describe what our data will look like if there are no effects due to our experiment
The sample mean (\(\bar{x}\)) is Normally Distributed around the population mean (\(\mu\))
We can use our sample variance (\(s^2\)) to estimate population variance (\(\sigma^2\))
A \(T\)-statistic can be
calculated using the formula
\(T = \frac{\bar{x} - \mu}{s / \sqrt{n}}\)
We compare our \(T\)-statistic to a \(t\)-distribution to see how likely more extreme values are if \(H_0\) is true, and reject \(H_0\) if these \(p\)-values are low.
In the following, try to describe the value that the experiment is investigating:
The average age of all students at the University of Adelaide was calculated. Is this an estimate? Explain your reasoning
Allele A | Allele B | |
---|---|---|
Hypoxia | 10 | 32 |
Normoxia | 94 | 86 |
What would the null and alternate hypotheses be for this experiment?
R
allows us to easily take random samples from a normal
distribution. We can generate 10 random samples from \(\mathcal{N}(5, 1)\) using the following
code.
We could then find the mean of our random sample using the following
We could actually repeat this 1000 times, very quickly in
R
. The following command will form a matrix with 1000
columns, and each column will be the 10 randomly sampled values.
We could get the sample means from each column using the command
colMeans()
and plot a histogram of the means.
This is a histogram of the sample means from each of our 1000 random samples.
We know from lectures that each sample mean will itself be a random sample from \[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]
In our example, we have set \(\mu = 5\), \(\sigma = 1\) and \(n = 10\). Let’s overlay this distribution to check.
x <- seq(3, 8, length.out = 100)
y <- dnorm(x, mean = 5, sd = 1 / sqrt(10))
hist(xbar, breaks = 50, freq = FALSE)
lines(x, y)
We can clearly see that each sample mean is truly a single random sample from this distribution.
In the above example, we knew the variance (\(\sigma = 1\)) because we set this value as
an argument to the rnorm
function in our random sampling
step. In reality we don’t know this value, so we would use our sample
variance (\(s^2\)) and a \(t\)-test to calculate the \(t\)-statistic and compare this to a null
distribution of the \(t\)-statistic.
Let’s have a look at how the sample variances are distributed.
Notice that the sample variances (\(s^2\)) are not normally distributed around the true variance (\(\sigma = 1\)).
For these random samples, we knew the true value for the population mean (\(\mu = 5\)). In reality, we wouldn’t know this and we’d need to conduct a hypothesis test with a null hypothesis and alternate hypothesis.
Let’s test the hypothesis using \(\mu = 5\) as the value of interest:
\[ H_0: \mu = 5 \quad \text{Vs} \quad H_A: \mu \neq 5 \]
To do this, we first calculate our \(T\)-statistic using: \[ T = \frac{\bar{x} - \mu}{s / \sqrt{n}} \] Let’s just do this on the first column to start with.
NB: Here the degrees of freedom is n - 1, so we would check this against the distribution for \(t_9\). In my simulations I got the value \(T =\) -0.1799. What value did you get from the first column in your simulations?
Let’s visualise this by first plotting the density of a \(t_9\) distribution, then adding vertical lines for our obtained statistic. By my estimation, 1 in 20 of you should get a relatively extreme \(T\)-statistic. (Feel free to just copy & paste this code chunk, but do try to understand what each line is doing.)
library(tidyverse)
x <- seq(-4, 4, length.out = 101) # Create a vector for the x-axis
y <- dt(x, df) # Find the value of the density at each value of x
data <- tibble(x = x, y = y) # Combine x & y into a tibble - i.e., a data frame
ggplot(data, aes(x = x, y = y)) +
geom_line() +
geom_ribbon(data = filter(data, x > abs(Tstat)),
aes(x= x, ymax =y, ymin = 0),
fill = "grey") +
geom_ribbon(data = filter(data, x < -abs(Tstat)),
aes(x= x, ymax =y, ymin = 0),
fill = "grey") +
geom_vline(xintercept = c(-1, 1)*Tstat) +
theme_bw() +
labs(x = "T", y = "") +
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line.y = element_blank(),
axis.line.x = element_line(colour = "black"),
panel.border = element_blank(),
panel.grid = element_blank())
We also learned about:
We could do this for every column in our matrix of random samples, which we know is a sample of 10 from \(\mathcal{N}(5, 1)\)
Tstat <- (colMeans(rnd) - 5) / (colSds(rnd) / sqrt(10))
pValues <- 2*pt(abs(Tstat), df, lower.tail = FALSE)
In many analyses, we use \(p < 0.05\) to reject \(H_0\)
We could use the Bonferroni procedure to reduce our Type I errors, and control the Family Wise Error Rate (FWER)
In R
we can simply adjust our \(p\)-values using the function
p.adjust()
. This will give identical results to the above,
but instead each adjusted-\(p\) value
will be \(min(1, p*m)\), where \(m = 1000\) and represents the number of
tests we are performing.
An alternative we briefly mentioned in lectures would be to allow a
small number of false discoveries in our set of results. This is known
as the FDR and we can easily obtain this in R
.
We often use simulated data to demonstrate statistical concepts like the central limit theorem and multiple testing procedures. As you have seen, we used simulated data from a t-distribution in the Statistics Tutorial to demonstrate what would happen if we did an experiment many, many times and calculated the means and variances of those repeat experiments. We do this because it can be difficult to generate test statistics from real data that contain combinations of results that come from both null and alternative hypotheses because we don’t know which hypotheses are truly null and which are alternative. So sometimes it can be helpful to combine real and simulated data to help demonstrate the concepts. We will use that combined approach in the following example.
This example used data on normalised chromatin accessibility across 68 human donors. Chromatin accessibility is a quantitative measurement of the packing of DNA in a location that indicates protein accessibility and binding to DNA, in particular regulatory proteins which might affect transcription of genes. The original data are available from this source - http://mitra.stanford.edu/kundaje/portal/chromovar3d/dnase.html. Large values correspond to more accessible chromatin, each column corresponds to a different human donor, and each row corresponds to a distinct location in the genome. The rows are actually peaks of accessibility that were detected in at least some subset of the samples.
For the purposes of this example, and in part because the original data are many hundreds of megabytes in size, the code below simulates \(p\)-values to match the results of an analysis of the data where individual donors were grouped according to whether they had a particular disease or not. The analysis of differences in normalised chromatin accessibility between the two groups of donors was conducted using a \(t\)-test for each of the 250,000 genomic loci in the data. The details about the following code are not important, so you can skip over this section (for those interested, we are simulating \(p\)-values from 250,000 tests where the null hypothesis of no difference in chromatin accessibility is true, and then spiking these results with 10,000 tests where the alternative hypothesis is true (i.e., \(p\)-values < 0.05)).
# Generate p-values under the null hypothesis of no difference
set.seed(3574)
tsPvals <- runif(250000, 0, 1)
# Spike in 10,000 tests where the null hypothesis will be rejected
# (i.e., inferring the alternative hypothesis)
tsPvals[1:10000] <- 10^runif(10000,-6.8,-1)
# Set Type 1 error rate (alpha)
alpha <- 0.05
# Histogram of the resulting p-values
brks <- 0:20/20
hist(tsPvals, col="grey", breaks=brks, main = "Histogram of P-values", xlab = "P-values")
Recall that a Type 1 error is an incorrect decision that occurs when we conduct a statistical test and reject a null hypothesis that is actually true. The researcher can set the rate of such errors that are deemed acceptable for a particular experiment. Generally, 1/20 such errors are deemed acceptable, so the error rate is 5%, which makes the probability of a Type 1 error equal to 0.05; this value is usually termed (α).
Therefore, when we conduct a single hypothesis test the probability of making an error equals α (1/20 = 0.05) and the probability of not making an error equals 1-α (19/20 = 0.95).
If we want to do m = 250,000
hypothesis tests, as in our
example data above, then if we use the standard α = 0.05
threshold as the Type 1 error rate for each test we would expect to
reject the null hypothesis in 12,500 (250,000 * (1/20)
)
tests by chance alone (i.e., we would deem 12,500 loci significantly
different between our treatment groups of donors by chance alone).
We could say that the per-family error rate (PFER), which is the
expected number of Type 1 Errors, is 12,500.
If we then define the Family-Wise Error Rate (FWER) as the
probability of at least one Type 1 Error, we can calculate this
probability across m
tests as
P(Making at least 1 error in m tests) = 1 - (1 - α)^m
. If
we have one hypothesis test, this equals
1 - (1 - 0.05)^1 = 0.05
. This probability will continue to
increase nonlinearly as we conduct more and more tests, and will
asymptote close to 1 by the time we get to 100 tests!
Clearly, we have a multiple testing problem to address for our example data because we are performing 250,000 hypothesis tests - our FWER = 1!
Let’s now ask How many pValues out of our simulated \(m = 250000\) tests do we see below 0.05?
So this number is greater than the expected PFER of 12,500 by chance alone. How do we correct for multiple testing to see which results we should retain for further experimentation to make useful scientific discoveries?
Many procedures have been developed to control the FWER (the probability of at least one type I error), and these fall into two types of FWER corrections: a single step type where equal adjustments are made to all \(p\)-values and a sequential type where adaptive adjustments are made to each \(p\)-value.
One single step approach is the Bonferroni
adjustment to ensure the FWER equals α by adjusting α to equal
α/m
. In our example we would adjust α to equal
0.05/250000 = 0.0000002
In R
we can simply adjust our \(p\)-values using the function
p.adjust()
. This will give identical results to the above
manual adjustment of α, but instead each adjusted-\(p\) value will be \(min(1, p*m)\), where \(m = 250000\) and represents the number of
tests we are performing.
So the Bonferroni adjustment is a severe adjustment and would result in only 170 null hypotheses being rejected and these loci being explored further. Among many criticisms of the approach are that it is counter-intuitive because the interpretation of findings depends on the number of other tests performed, that the general null hypothesis (that all the null hypotheses are true) is rarely of interest, and that it leads to a high probability of Type 2 Errors (i.e., of failing to reject the null hypothesis when important effects exist).
To address these concerns, one sequential approach is to
control the False Discovery Rate (FDR), which is the expected proportion
of Type I errors (false positives) among the rejected hypotheses. We
want to control this FDR at α = 0.05
. As described in the
lecture, this approach is sequential because we first order the
unadjusted \(p\)-values in ascending
order (i.e., from smallest to largest). Then we work our way
sequentially through the ordered list of \(p\)-values and evaluate whether each \(p\)-value is less than or equal to
(j/m)*α
where j
is the rank order and reject
the null hypotheses in these cases. We can stop when we reach the first
\(p\)-value in the sequence that is
greater than (j/m)*α
because all subsequent \(p\)-values will also fail to reject the
null hypothesis.
The following code used the False Discovery Rate (FDR) adjustment for our example results:
Here we see that the FDR results in 7169 rejections that we can pursue with further research. This result is therefore intermediate between the result we would come to without accounting for multiple tests rejecting 21551 tests and the extremely conservative Bonferroni approach where we would reject only 170 tests.
The FDR balances between making too many Type 1 Errors and wasting lost of money chasing dead ends with making too many Type 2 Errors and missing lots of important discoveries.
All of the above examples refer to normally distributed data. Let’s say we have some pretty extreme values.
Fisher’s Exact Test can be used for comparing counts in a \(2 \times 2\) contingency table. In the hypoxia example above these values would be:
counts <- matrix(c(10, 94, 32, 86), ncol = 2)
rownames(counts) <- c("Hypoxia", "Normoxia")
colnames(counts) <- c("Allele A", "Allele B")
Allele A | Allele B | |
---|---|---|
Hypoxia | 10 | 32 |
Normoxia | 94 | 86 |
Here the null hypothesis & alternative are:
\[ H_0: \text{There is no association between alleles and survival in hypoxic conditions} \\ Vs \\ H_A: \text{There is an association between alleles and survival in hypoxic conditions} \]
Let’s do a quick \(t\)-test in
R
and see what the results are. The null and alternative
hypotheses for a \(t\)-test are: \[
H_0: \mu_1 = \mu_2 \quad \text{Vs} \quad H_A: \mu_1 \neq \mu_2
\]
We can clearly see that these two samples are drawn from different distributions. The problem is that our data is not normally distributed (especially sample 1) so the \(t\)-test cannot cope with these extreme violations of normality. It is also pretty clear that for data like this, our estimates of the population variance will be much larger than the difference.
An alternative would be the Wilcoxon Rank-Sum Test, where each value is given a rank and the ranks are compared.
\[ H_0: \text{The two samples are drawn from the same distribution} \\ VS \\ H_A: \text{The two samples are drawn from different distributions} \]
The ranks for each value can be seen using the following code:
Note that because we use the ranks, instead of the values, this procedure is robust to changes in the actual values which don’t affect the ranks.
1000
to 10
will
it change the results of the Wilcoxon Rank Sum Test?In R
we can simply conduct the test using
fisher.test()
Alternatively, we could’ve used a \(\chi^2\) test, however we would’ve need to ensure that no cell in the table had an expected count \(<5\).