Types of Statistical Inference

Single Hypothesis Testing

Recap

In the lectures we discussed:

Normally Distributed Data

  1. Every experiment is a random sample of the complete set of all possible samples

  2. We define a null hypothesis (\(H_0\)) so we can describe what our data will look like if there are no effects due to our experiment

  3. The sample mean (\(\bar{x}\)) is Normally Distributed around the population mean (\(\mu\))

  4. We can use our sample variance (\(s^2\)) to estimate population variance (\(\sigma^2\))

  5. A \(T\)-statistic can be calculated using the formula

    \(T = \frac{\bar{x} - \mu}{s / \sqrt{n}}\)

  6. We compare our \(T\)-statistic to a \(t\)-distribution to see how likely more extreme values are if \(H_0\) is true, and reject \(H_0\) if these \(p\)-values are low.

Exercises

Question 1

In the following, try to describe the value that the experiment is investigating:

  1. The SNP at https://gnomad.broadinstitute.org/variant/3-8787263-G-A?dataset=exac was found in 0.09% of 16366 South Asian genomes, and 0.05% of 8644 East Asian genomes.
    1. What do these numbers represent?
    2. Do you think that these values represent normally distributed data?
  2. In a qPCR experiment, the gene CTLA4 was found to be expressed in Treg cells at levels 10 times higher than those found in Th cells. What is this number an estimate of?
  3. In three transfection experiments, the average transfection efficiency was found to be 6%. What is this number estimating?

Question 2

The average age of all students at the University of Adelaide was calculated. Is this an estimate? Explain your reasoning

Question 3

  1. We think that a gene will be silenced by over-expression of a targeted siRNA. What would our null hypothesis and alternate hypothesis be?
  2. We suspect that a given SNP will provide increased zebrafish survival in a low-oxygen environment. We place 100 fish in a low-oxygen tank, and 100 fish in a normal-oxygen tank for 6 hours. At the end, we collect the survivors in each tank and genotype them at this SNP, giving the following results:
Allele A Allele B
Hypoxia 10 32
Normoxia 94 86

What would the null and alternate hypotheses be for this experiment?

Question 4

R allows us to easily take random samples from a normal distribution. We can generate 10 random samples from \(\mathcal{N}(5, 1)\) using the following code.

x <- rnorm(n = 10, mean = 5, sd = 1)

We could then find the mean of our random sample using the following

mean(x)
  1. Why was this value not equal to 5?
  2. Repeat the random sampling several times. Do you ever get 5 exactly?

We could actually repeat this 1000 times, very quickly in R. The following command will form a matrix with 1000 columns, and each column will be the 10 randomly sampled values.

rnd <- replicate(1000, rnorm(n = 10, mean = 5, sd = 1))
ncol(rnd)
nrow(rnd)

We could get the sample means from each column using the command colMeans() and plot a histogram of the means.

xbar <- colMeans(rnd)
hist(xbar, breaks = 50, freq = FALSE)

This is a histogram of the sample means from each of our 1000 random samples.

  1. Does it look like a normal distribution?

We know from lectures that each sample mean will itself be a random sample from \[ \bar{x} \sim \mathcal{N}(\mu, \frac{\sigma}{\sqrt{n}}) \]

In our example, we have set \(\mu = 5\), \(\sigma = 1\) and \(n = 10\). Let’s overlay this distribution to check.

x <- seq(3, 8, length.out = 100)
y <- dnorm(x, mean = 5, sd = 1 / sqrt(10))
hist(xbar, breaks = 50, freq = FALSE)
lines(x, y)

We can clearly see that each sample mean is truly a single random sample from this distribution.

Question 5

In the above example, we knew the variance (\(\sigma = 1\)) because we set this value as an argument to the rnorm function in our random sampling step. In reality we don’t know this value, so we would use our sample variance (\(s^2\)) and a \(t\)-test to calculate the \(t\)-statistic and compare this to a null distribution of the \(t\)-statistic. Let’s have a look at how the sample variances are distributed.

library(matrixStats)
s2 <- colVars(rnd)
hist(s2, breaks = 50)

Notice that the sample variances (\(s^2\)) are not normally distributed around the true variance (\(\sigma = 1\)).

For these random samples, we knew the true value for the population mean (\(\mu = 5\)). In reality, we wouldn’t know this and we’d need to conduct a hypothesis test with a null hypothesis and alternate hypothesis.

Let’s test the hypothesis using \(\mu = 5\) as the value of interest:

\[ H_0: \mu = 5 \quad \text{Vs} \quad H_A: \mu \neq 5 \]

To do this, we first calculate our \(T\)-statistic using: \[ T = \frac{\bar{x} - \mu}{s / \sqrt{n}} \] Let’s just do this on the first column to start with.

Tstat <- (mean(rnd[,1]) - 5) / (sd(rnd[,1]) / sqrt(10))
df <- 10 - 1

NB: Here the degrees of freedom is n - 1, so we would check this against the distribution for \(t_9\). In my simulations I got the value \(T =\) -0.1799. What value did you get from the first column in your simulations?

Let’s visualise this by first plotting the density of a \(t_9\) distribution, then adding vertical lines for our obtained statistic. By my estimation, 1 in 20 of you should get a relatively extreme \(T\)-statistic. (Feel free to just copy & paste this code chunk, but do try to understand what each line is doing.)

library(tidyverse)
x <- seq(-4, 4, length.out = 101) # Create a vector for the x-axis
y <- dt(x, df) # Find the value of the density at each value of x
data <- tibble(x = x, y = y) # Combine x & y into a tibble - i.e., a data frame
ggplot(data, aes(x = x, y = y)) +
  geom_line() +
  geom_ribbon(data = filter(data, x > abs(Tstat)),
              aes(x= x, ymax =y, ymin = 0),
              fill = "grey") +
  geom_ribbon(data = filter(data, x < -abs(Tstat)),
              aes(x= x, ymax =y, ymin = 0),
              fill = "grey") +
  geom_vline(xintercept = c(-1, 1)*Tstat) +
  theme_bw() +
  labs(x = "T", y = "") +
  scale_y_continuous(expand = c(0, 0)) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.line.y = element_blank(),
        axis.line.x = element_line(colour = "black"),
        panel.border = element_blank(),
        panel.grid = element_blank())
  1. Do you think we have a low or high probability of observing this \(t\)-statistic? Let’s check
2*pt(abs(Tstat), df, lower.tail = FALSE)

Multiple Hypothesis Testing

Recap

We also learned about:

  1. Type I and Type II errors
  2. The Family-Wise Error Rate (FWER)
  3. The Bonferroni Adjustment
  4. The False Discovery Rate

Exercises

Question 6

We could do this for every column in our matrix of random samples, which we know is a sample of 10 from \(\mathcal{N}(5, 1)\)

Tstat <- (colMeans(rnd) - 5) / (colSds(rnd) / sqrt(10))
pValues <- 2*pt(abs(Tstat), df, lower.tail = FALSE)

In many analyses, we use \(p < 0.05\) to reject \(H_0\)

  1. How many pValues out of our \(m = 1000\) tests do we see below 0.05?
sum(pValues < 0.05)
  1. Should we reject \(H_0\) for these random samples?
  2. If so, would this be a correct rejection of \(H_0\), a Type I error or a Type II error?

We could use the Bonferroni procedure to reduce our Type I errors, and control the Family Wise Error Rate (FWER)

alpha <- 0.05/1000
sum(pValues < alpha)

In R we can simply adjust our \(p\)-values using the function p.adjust(). This will give identical results to the above, but instead each adjusted-\(p\) value will be \(min(1, p*m)\), where \(m = 1000\) and represents the number of tests we are performing.

adjustedP <- p.adjust(pValues, method = "bonferroni")
sum(adjustedP < 0.05)

An alternative we briefly mentioned in lectures would be to allow a small number of false discoveries in our set of results. This is known as the FDR and we can easily obtain this in R.

fdrP <- p.adjust(pValues, method = "fdr")
sum(fdrP < 0.05) 
  1. Did you obtain different results using the two different methods?
  2. Explain why or why not?

Extended Example Using Simulation to Demonstrate Concepts

We often use simulated data to demonstrate statistical concepts like the central limit theorem and multiple testing procedures. As you have seen, we used simulated data from a t-distribution in the Statistics Tutorial to demonstrate what would happen if we did an experiment many, many times and calculated the means and variances of those repeat experiments. We do this because it can be difficult to generate test statistics from real data that contain combinations of results that come from both null and alternative hypotheses because we don’t know which hypotheses are truly null and which are alternative. So sometimes it can be helpful to combine real and simulated data to help demonstrate the concepts. We will use that combined approach in the following example.

Example genomic data

This example used data on normalised chromatin accessibility across 68 human donors. Chromatin accessibility is a quantitative measurement of the packing of DNA in a location that indicates protein accessibility and binding to DNA, in particular regulatory proteins which might affect transcription of genes. The original data are available from this source - http://mitra.stanford.edu/kundaje/portal/chromovar3d/dnase.html. Large values correspond to more accessible chromatin, each column corresponds to a different human donor, and each row corresponds to a distinct location in the genome. The rows are actually peaks of accessibility that were detected in at least some subset of the samples.

For the purposes of this example, and in part because the original data are many hundreds of megabytes in size, the code below simulates \(p\)-values to match the results of an analysis of the data where individual donors were grouped according to whether they had a particular disease or not. The analysis of differences in normalised chromatin accessibility between the two groups of donors was conducted using a \(t\)-test for each of the 250,000 genomic loci in the data. The details about the following code are not important, so you can skip over this section (for those interested, we are simulating \(p\)-values from 250,000 tests where the null hypothesis of no difference in chromatin accessibility is true, and then spiking these results with 10,000 tests where the alternative hypothesis is true (i.e., \(p\)-values < 0.05)).

# Generate p-values under the null hypothesis of no difference
set.seed(3574)
tsPvals <- runif(250000, 0, 1)
# Spike in 10,000 tests where the null hypothesis will be rejected 
# (i.e., inferring the alternative hypothesis)
tsPvals[1:10000] <- 10^runif(10000,-6.8,-1)
# Set Type 1 error rate (alpha)
alpha <- 0.05
# Histogram of the resulting p-values
brks <- 0:20/20
hist(tsPvals, col="grey", breaks=brks, main = "Histogram of P-values", xlab = "P-values")

Single Hypothesis testing

Recall that a Type 1 error is an incorrect decision that occurs when we conduct a statistical test and reject a null hypothesis that is actually true. The researcher can set the rate of such errors that are deemed acceptable for a particular experiment. Generally, 1/20 such errors are deemed acceptable, so the error rate is 5%, which makes the probability of a Type 1 error equal to 0.05; this value is usually termed (α).

Therefore, when we conduct a single hypothesis test the probability of making an error equals α (1/20 = 0.05) and the probability of not making an error equals 1-α (19/20 = 0.95).

Multiple hypothesis testing

If we want to do m = 250,000 hypothesis tests, as in our example data above, then if we use the standard α = 0.05 threshold as the Type 1 error rate for each test we would expect to reject the null hypothesis in 12,500 (250,000 * (1/20)) tests by chance alone (i.e., we would deem 12,500 loci significantly different between our treatment groups of donors by chance alone). We could say that the per-family error rate (PFER), which is the expected number of Type 1 Errors, is 12,500.

If we then define the Family-Wise Error Rate (FWER) as the probability of at least one Type 1 Error, we can calculate this probability across m tests as P(Making at least 1 error in m tests) = 1 - (1 - α)^m. If we have one hypothesis test, this equals 1 - (1 - 0.05)^1 = 0.05. This probability will continue to increase nonlinearly as we conduct more and more tests, and will asymptote close to 1 by the time we get to 100 tests!

Clearly, we have a multiple testing problem to address for our example data because we are performing 250,000 hypothesis tests - our FWER = 1!

Let’s now ask How many pValues out of our simulated \(m = 250000\) tests do we see below 0.05?

sum(tsPvals < alpha)

So this number is greater than the expected PFER of 12,500 by chance alone. How do we correct for multiple testing to see which results we should retain for further experimentation to make useful scientific discoveries?

Correcting the error rate for multiple tests

Many procedures have been developed to control the FWER (the probability of at least one type I error), and these fall into two types of FWER corrections: a single step type where equal adjustments are made to all \(p\)-values and a sequential type where adaptive adjustments are made to each \(p\)-value.

Bonferroni Adjustment

One single step approach is the Bonferroni adjustment to ensure the FWER equals α by adjusting α to equal α/m. In our example we would adjust α to equal 0.05/250000 = 0.0000002

alphaADJ <- 0.05/250000
sum(tsPvals < alphaADJ)

In R we can simply adjust our \(p\)-values using the function p.adjust(). This will give identical results to the above manual adjustment of α, but instead each adjusted-\(p\) value will be \(min(1, p*m)\), where \(m = 250000\) and represents the number of tests we are performing.

adjustedP <- p.adjust(tsPvals, method = "bonferroni")
sum(adjustedP < 0.05)

So the Bonferroni adjustment is a severe adjustment and would result in only 170 null hypotheses being rejected and these loci being explored further. Among many criticisms of the approach are that it is counter-intuitive because the interpretation of findings depends on the number of other tests performed, that the general null hypothesis (that all the null hypotheses are true) is rarely of interest, and that it leads to a high probability of Type 2 Errors (i.e., of failing to reject the null hypothesis when important effects exist).

False Discovery Rate

To address these concerns, one sequential approach is to control the False Discovery Rate (FDR), which is the expected proportion of Type I errors (false positives) among the rejected hypotheses. We want to control this FDR at α = 0.05. As described in the lecture, this approach is sequential because we first order the unadjusted \(p\)-values in ascending order (i.e., from smallest to largest). Then we work our way sequentially through the ordered list of \(p\)-values and evaluate whether each \(p\)-value is less than or equal to (j/m)*α where j is the rank order and reject the null hypotheses in these cases. We can stop when we reach the first \(p\)-value in the sequence that is greater than (j/m)*α because all subsequent \(p\)-values will also fail to reject the null hypothesis.

The following code used the False Discovery Rate (FDR) adjustment for our example results:

adjustedP_BH <- p.adjust(tsPvals, method = "BH")
sum(adjustedP_BH < 0.05)

Here we see that the FDR results in 7169 rejections that we can pursue with further research. This result is therefore intermediate between the result we would come to without accounting for multiple tests rejecting 21551 tests and the extremely conservative Bonferroni approach where we would reject only 170 tests.

The FDR balances between making too many Type 1 Errors and wasting lost of money chasing dead ends with making too many Type 2 Errors and missing lots of important discoveries.

Non-parametric Testing

Recap

The Wilcoxon Rank-Sum Test

All of the above examples refer to normally distributed data. Let’s say we have some pretty extreme values.

sample1 <- c(5.5, 2.45, 1000, 6)
sample2 <- c(-1.1, 1.2, 2.4, -40)

Fisher’s Exact Test

Fisher’s Exact Test can be used for comparing counts in a \(2 \times 2\) contingency table. In the hypoxia example above these values would be:

counts <- matrix(c(10, 94, 32, 86), ncol = 2)
rownames(counts) <- c("Hypoxia", "Normoxia")
colnames(counts) <- c("Allele A", "Allele B")
  Allele A Allele B
Hypoxia 10 32
Normoxia 94 86

Here the null hypothesis & alternative are:

\[ H_0: \text{There is no association between alleles and survival in hypoxic conditions} \\ Vs \\ H_A: \text{There is an association between alleles and survival in hypoxic conditions} \]

Exercises

Question 7

  1. Do you think these values come from the same distribution?

Let’s do a quick \(t\)-test in R and see what the results are. The null and alternative hypotheses for a \(t\)-test are: \[ H_0: \mu_1 = \mu_2 \quad \text{Vs} \quad H_A: \mu_1 \neq \mu_2 \]

t.test(sample1, sample2)

We can clearly see that these two samples are drawn from different distributions. The problem is that our data is not normally distributed (especially sample 1) so the \(t\)-test cannot cope with these extreme violations of normality. It is also pretty clear that for data like this, our estimates of the population variance will be much larger than the difference.

An alternative would be the Wilcoxon Rank-Sum Test, where each value is given a rank and the ranks are compared.

\[ H_0: \text{The two samples are drawn from the same distribution} \\ VS \\ H_A: \text{The two samples are drawn from different distributions} \]

wTest <- wilcox.test(sample1, sample2)
wTest

The ranks for each value can be seen using the following code:

 order(c(sample1, sample2), decreasing = TRUE)

Note that because we use the ranks, instead of the values, this procedure is robust to changes in the actual values which don’t affect the ranks.

  1. If we change the value 1000 to 10 will it change the results of the Wilcoxon Rank Sum Test?
  2. Would it change the results for a \(t\)-test?
sample1[3] <- 10
sample1
t.test(sample1, sample2)
wilcox.test(sample1, sample2)

Question 8

  1. Do you think there is an association just by looking at the data?

In R we can simply conduct the test using fisher.test()

fisher.test(counts)

Alternatively, we could’ve used a \(\chi^2\) test, however we would’ve need to ensure that no cell in the table had an expected count \(<5\).

chisq.test(counts)
  1. Would you reject \(H_0\) using either test?