In the lectures we discussed a few things:
For data that is not normally distributed we learned two new tests besides a \(t\)-test:
In the following, try to describe the value that the experiment is investigating:
The average age of all students at the University of Adelaide was calculated. Is this an estimate? Explain your reasoning
Allele A | Allele B | |
---|---|---|
Hypoxia | 10 | 32 |
Normoxia | 94 | 86 |
What would the null and alternate hypotheses be for this experiment?
R
allows us to easily take random samples from a normal
distribution. We can generate 10 random samples from \(\mathcal{N}(5, 1)\) using the following
code.
x <- rnorm(n = 10, mean = 5, sd = 1)
We could then find the mean of our random sample using the following
mean(x)
We could actually repeat this 1000 times, very quickly in
R
. The following command will form a matrix with 1000
columns, and each column will be the 10 randomly sampled values.
rnd <- replicate(1000, rnorm(n = 10, mean = 5, sd = 1))
ncol(rnd)
nrow(rnd)
We could get the sample means from each column using the command
colMeans()
and plot a histogram of the means.
xbar <- colMeans(rnd)
hist(xbar, breaks = 50, freq = FALSE)
This is a histogram of the sample means from each of our 1000 random samples.
We know from lectures that each sample mean will itself be a random sample from \[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]
In our example, we have set \(\mu = 5\), \(\sigma = 1\) and \(n = 10\). Let’s overlay this distribution to check.
x <- seq(3, 8, length.out = 100)
y <- dnorm(x, mean = 5, sd = 1 / sqrt(10))
hist(xbar, breaks = 50, freq = FALSE)
lines(x, y)
We can clearly see that each sample mean is truly a single random sample from this distribution.
In the above example, we knew the variance (\(\sigma = 1\)) because we set this value as
an argument to the rnorm
function in our random sampling
step. In reality we don’t know this value, so we would use our sample
variance (\(s^2\)) and a \(t\)-test to calculate the \(t\)-statistic and compare this to a null
distribution of the \(t\)-statistic.
Let’s have a look at how the sample variances are distributed.
library(matrixStats)
s2 <- colVars(rnd)
hist(s2, breaks = 50)
Notice that the sample variances (\(s^2\)) are not normally distributed around the true variance (\(\sigma = 1\)).
For these random samples, we knew the true value for the population mean (\(\mu = 5\)). In reality, we wouldn’t know this and we’d need to conduct a hypothesis test with a null hypothesis and alternate hypothesis.
Let’s test the hypothesis using \(\mu = 5\) as the value of interest:
\[ H_0: \mu = 5 \quad \text{Vs} \quad H_A: \mu \neq 5 \]
To do this, we first calculate our \(T\)-statistic using: \[ T = \frac{\bar{x} - \mu}{s / \sqrt{n}} \] Let’s just do this on the first column to start with.
Tstat <- (mean(rnd[,1]) - 5) / (sd(rnd[,1]) / sqrt(10))
df <- 10 - 1
NB: Here the degrees of freedom is n - 1, so we would check this against the distribution for \(t_9\). In my simulations I got the value \(T =\) -0.8771. What value did you get from the first column in your simulations?
Let’s visualise this by first plotting the density of a \(t_9\) distribution, then adding vertical lines for our obtained statistic. By my estimation, 1 in 20 of you should get a relatively extreme \(T\)-statistic. (Feel free to just copy & paste this code chunk, but do try to understand what each line is doing.)
library(tidyverse)
x <- seq(-4, 4, length.out = 101) # Create a vector for the x-axis
y <- dt(x, df) # Find the value of the density at each value of x
data <- tibble(x = x, y = y) # Combine x & y into a tibble - i.e., a data frame
ggplot(data, aes(x = x, y = y)) +
geom_line() +
geom_ribbon(data = filter(data, x > abs(Tstat)),
aes(x= x, ymax =y, ymin = 0),
fill = "grey") +
geom_ribbon(data = filter(data, x < -abs(Tstat)),
aes(x= x, ymax =y, ymin = 0),
fill = "grey") +
geom_vline(xintercept = c(-1, 1)*Tstat) +
theme_bw() +
labs(x = "T", y = "") +
scale_y_continuous(expand = c(0, 0)) +
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.line.y = element_blank(),
axis.line.x = element_line(colour = "black"),
panel.border = element_blank(),
panel.grid = element_blank())
2*pt(abs(Tstat), df, lower.tail = FALSE)
We also learned about:
We could do this for every column in our matrix of random samples, which we know is a sample of 10 from \(\mathcal{N}(5, 1)\)
Tstat <- (colMeans(rnd) - 5) / (colSds(rnd) / sqrt(10))
pValues <- 2*pt(abs(Tstat), df, lower.tail = FALSE)
In many analyses, we use \(p < 0.05\) to reject \(H_0\)
sum(pValues < 0.05)
We could use the Bonferroni procedure to reduce our Type I errors, and control the Family Wise Error Rate (FWER)
alpha <- 0.05/1000
sum(pValues < alpha)
In R
we can simply adjust our \(p\)-values using the function
p.adjust()
. This will give identical results to the above,
but instead each adjusted-\(p\) value
will be \(min(1, p*m)\), where \(m = 1000\) and represents the number of
tests we are performing.
adjustedP <- p.adjust(pValues, method = "bonferroni")
sum(adjustedP < 0.05)
An alternative we briefly mentioned in lectures would be to allow a
small number of false discoveries in our set of results. This is known
as the FDR and we can easily obtain this in R
.
fdrP <- p.adjust(pValues, method = "fdr")
sum(fdrP < 0.05)
All of the above examples refer to normally distributed data. Let’s say we have some pretty extreme values.
sample1 <- c(5.5, 2.45, 1000, 6)
sample2 <- c(-1.1, 1.2, 2.4, -40)
Let’s do a quick \(t\)-test in
R
and see what the results are. The null and alternative
hypotheses for a \(t\)-test are: \[
H_0: \mu_1 = \mu_2 \quad \text{Vs} \quad H_A: \mu_1 \neq \mu_2
\]
t.test(sample1, sample2)
We can clearly see that these two samples are drawn from different distributions. The problem is that our data is not normally distributed (especially sample 1) so the \(t\)-test cannot cope with these extreme violations of normality. It is also pretty clear that for data like this, our estimates of the population variance will be much larger than the difference.
An alternative would be the Wilcoxon Rank-Sum Test, where each value is given a rank and the ranks are compared.
\[ H_0: \text{The two samples are drawn from the same distribution} \\ VS \\ H_A: \text{The two samples are drawn from different distributions} \]
wTest <- wilcox.test(sample1, sample2)
wTest
The ranks for each value can be seen using the following code:
order(c(sample1, sample2), decreasing = TRUE)
Note that because we use the ranks, instead of the values, this procedure is robust to changes in the actual values which don’t affect the ranks.
1000
to 10
will
it change the results of the Wilcoxon Rank Sum Test?sample1[3] <- 10
sample1
t.test(sample1, sample2)
wilcox.test(sample1, sample2)
Fisher’s Exact Test can be used for comparing counts in a \(2 \times 2\) contingency table. In the hypoxia example above these values would be:
counts <- matrix(c(10, 94, 32, 86), ncol = 2)
rownames(counts) <- c("Hypoxia", "Normoxia")
colnames(counts) <- c("Allele A", "Allele B")
Allele A | Allele B | |
---|---|---|
Hypoxia | 10 | 32 |
Normoxia | 94 | 86 |
Here the null hypothesis & alternative are:
\[ H_0: \text{There is no association between alleles and survival in hypoxic conditions} \\ Vs \\ H_A: \text{There is an association between alleles and survival in hypoxic conditions} \]
In R
we can simply conduct the test using
fisher.test()
fisher.test(counts)
Alternatively, we could’ve used a \(\chi^2\) test, however we would’ve need to ensure that no cell in the table had an expected count \(<5\).
chisq.test(counts)