Proving or disproving a research hypothesis requires representative evidence which may help us substantiate our claims. In general, we need a large amount of data following a rigorous and well-documented procedure. However, inviting all potential subjects is not feasible due to time and resource constraint. Considering the amount of limitation we have, how do we determine the minimum required amount of data to answer our research question?

P-value

We reject our $H_0$ when our statistical test resulted in a p-value < 0.05, why do we use 0.05 as our cut-off point? In a simple term, 0.05 simply reflects a 5% chance of having a correct null hypothesis. Though, I am more inclined to regard it as a probability value. When the probability is small enough, we reject our $H_0$. To help us visualize the idea, we can use a formal approach to answer our hypothesis on (ahem) a simple coin toss.

set.seed(1)
coin <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()

|  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"

As always, we set H as our outcome of interest. After observing ten coin tosses, we may find the appeal to formulate our hypotheses, where we may say:

$H_0: P(X=x) = 0.5$
$H_a: P(X=x) \neq 0.5$

Assuming independence of a fair coin, it is a Bernoulli trial and follows the binomial distribution. But, does it? :)

binom.test(x=sum(coin == "H"), n=length(coin), p=0.5)

| 
|   Exact binomial test
| 
| data:  sum(coin == "H") and length(coin)
| number of successes = 6, number of trials = 10, p-value = 0.8
| alternative hypothesis: true probability of success is not equal to 0.5
| 95 percent confidence interval:
|  0.2624 0.8784
| sample estimates:
| probability of success 
|                    0.6

At this point, we have seen binomial test numerous times, yet we have not unravel the secret behind this math! Why did we fail to reject the $H_0$? Or rather, why does the p-value > 0.05? The question we wish to answer is:

What is the probability of having 6 H out of 10 Bernoulli trials? Is it < 5%?

To answer our question, we may use Binomial probability function to find our probability. To find a $P(X=6): X \sim B(10, 0.5)$, R uses following command:

dbinom(6, 10, 0.5)

| [1] 0.2051

We can manually calculate the p-value as the .amber[sum] of $P(X \geqslant 6)$.

2 * (dbinom(6:10, 10, 0.5) %>% sum())

| [1] 0.7539

So we could not distinguish a relative probability of 0.6 and 0.5 from a ten consecutive coin tosses. Interesting. How if we preserve the ratio of event (3:5) using more trials?

As previously done, we can find the probability $P(X=60): X \sim B(100, 0.5)$ using the Binomial probability function:

dbinom(60, 100, 0.5)

| [1] 0.01084

And the p-value would be:

2 * (dbinom(60:100, 100, 0.5) %>% sum())

| [1] 0.05689

We preserved the ratio, why has the probability changed? The reason lies in the number of trial we conducted. We have observed how the peak and distribution width changes. With more trial, the narrower observed distributions get. That, in itself, does not change our critical value at 5%, but they change where it is located relative to the mid-point. Theoretically, p-value is difficult to understand. But in practice, it tells you the probability of having a correct $H_0$.

Low p-value $\to$ reject $H_0$

Cassie Kozyrkov uploaded a splendid example on how to comprehend the p-value:

Significance Level

In interpreting p-value, we use a threshold of 0.05, which reflect our significance level $\alpha$. With a higher $\alpha$, we have more chance to reject the $H_0$, and so does our chance get higher when we have more sample. However, we may need to worry about incorrect rejection, viz. what is the chance of we incorrectly reject a true $H_0$?

Suppose we are conducting a study on a potential cancer therapy. We knew giving the patient a placebo may affect their recovery rate by 50%. We are certain that giving the new treatment will increase the probability. Tested on 50 patients, 35 showed signs of better quality of life.

Considering I.I.D, we may model our problem as a Binomial distribution $Cured \sim B(50, 0.5)$, so we can state our hypothesis as:

\begin{align} H_0 &: P(X=35) = 0.5 \\ H_a &: P(X=35) > 0.5 \end{align}

Since the data follows the Binomial distribution, we can assess our hypothesis using the Binomial test:

binom.test(35, 50, 0.5, alternative="greater")

| 
|   Exact binomial test
| 
| data:  35 and 50
| number of successes = 35, number of trials = 50, p-value = 0.003
| alternative hypothesis: true probability of success is greater than 0.5
| 95 percent confidence interval:
|  0.5763 1.0000
| sample estimates:
| probability of success 
|                    0.7

With a nice visualization, we can see where our hypothesis located relative to the midpoint (hint: the blue dot).

We are assuming $H_a > H_0$, how do we picture $\alpha$ in our figure? Since $\alpha$ is the significance level of 0.05, we need to find a point of $x : P(X=0.05 | 50, 0.5)$. After doing so, we can find our significance value in the x-axis (the red dashed line).

Area to the right of our significance value determines the probability of getting a type I error $\alpha$. If we were to make a second distribution reflecting our $H_a$, we may observe intersected area reflecting our probability of getting a type II error $\beta$. Assuming $H_a$ coming from similar distribution as $H_0$, we just need to determine tis parameter. However, the value of $\beta$ depends on the $H_a$ distribution, and we only stated $H_a > H_0$. It means, the parameter in $H_a$ can be any value higher than 0.5, either be 0.6, 0.7, 0.8, and so on. For our convenience, we will assign 0.7 as our $H_a$ so we can construct the second binomial distribution.

The area to the left of our significance value in the second $H_a$ distribution is the probability of getting a type II error $\beta$. We can summarize the characteristic of type I and type II statistical error using following anecdote:

Type I statistical error:

Incorrectly rejecting the $H_0$
Reflected as $\alpha \to$ shaded area to the right of $H_0$ distribution
A false positive

Type II statistical error:

Incorrectly accepting the $H_0$
Reflected as $\beta \to$ shaded area to the left of $H_a$ distribution
A false negative

Power Analysis

So far, we have learnt there ought to be some probability of incorrectly rejecting the $H_0$, which highly depends on the significance level $\alpha$, rate of type II error $\beta$, and the sample size. We can adjust $\alpha$ to reduce the risk of having type I error at the cost of increasing the risk for type II error. Such a problem is the art of performing statistics, where we may tend to justify one type of error instead of another. Imagine this case in a court as a statistical problem. We have charged someone as guilty of robbery and we need to prove their guilt before sending them to jail. After collecting some evidence, we may have two possible error:

We incorrectly put an innocent person in jail (type I error)
We failed to identify the culprit and let them free (type II error)

What is the more appropriate case to justify, letting a criminal free or arresting an innocent citizen? That is the question we need to deal with in statistics, of whether we should accept a type I or type II error. However, if we have more data (analogous to more evidence in court settings), we may reduce both error at the expense of time and effort. A good research demand a sufficient amount of data to prevent both type of error. Though, please be advised, we are not getting anywhere near to type III and other special errors in statistics. For now, understanding the relation between statistical errors and sample size is enough to introduce us a new topic: statistical power.

In statistics, we define power as a simple equation of $1 - \beta$. However, determining the value of $\beta$ is a bit tricky, as we have visually observed. In essence, a good statistical power helps us to correctly reject the $H_0$ when it is actually false. As a rule of thumb, we expect to have a statistical power in the range of 0.8 - 0.9 (higher is better). We can analyze the power prospectively or retrospectively, depends on the context of what we would like to prove. When done prospectively, a thorough power analysis can also help us determining the minimum required sample.

However, conducting a power analysis is not all roses, since it comes with several caveats to consider. First of all, power analysis depends on formal methods to use when we design the statistical analysis pipeline. Since it depends on our statistical methods and study designs, it often does not generalize that well. It means that, if we were to propose a different statistical method, we may need to adjust our power analysis. And lastly, power analysis only tells us the best case scenario estimate.

There are four linked concepts to comprehend the statistical power:

Power
Effect size
Sample size
Alpha

So far, we have discussed what the other three are, but it is the first time we hear about effect size. As a disclaimer though, this post only aims to give a brief overview of what effect size is. We may not delve in further to formally prove the equation we will later use. In a layman term, effect size measures a true difference between two hypotheses. Hitherto, numerous conventions exist to calculate the effect size. The higher the effect size we have reflects in more statistical power. And as all good thing goes, effect size is one of the most difficult to obtain.

To obtain an effect size, we may initially perform a thorough literature review, conduct a pilot study, or follow Cohen’s recommendation. When reviewing published articles, we may find some with similar methods as we proposed. In such cases, we can use their data to estimate the desired effect size. Meta-analysis technique is sometimes applicable to make a better estimate. When we fail to get published articles with similar methods and elaborate results, we may find the appeal of conducting a pilot study. By doing so, we can get data reflecting our future study. It is time consuming, but it gives us a closer estimate of how our data will be. Also, a pilot study is a good chance to resolve any unforeseen issue. Following Cohen’s recommendation may help us determine what effect size we can regard as being adequate in a prospective power analysis. Still, it depends on what formal test to use, where we can use a certain threshold to separate a small, medium and large effect size.

All words and no number would not help us much to understand the context, so we shall dig into an example! We will use our previous case of a novel cancer drug trial. Can we recall this figure?

We can calculate power when we know the probability function and its parameters, as such:

$$Let\ X \sim B(n, p)$$

\begin{align} sig &= x:P(X=1-\alpha\ |\ n, H_0) \\ \beta &= P(X \leqslant sig\ |\ n, H_1) \\ Power &= 1 - \beta \end{align}

Plain math does not help us to stick in with the idea, how about some code?

# Set H0, sample size, significance level (alpha)
h0 <- 0.5; size <- 50; alpha.rate <- 0.05

# Find significance value
alpha.value <- qbinom(1 - alpha.rate, size, prob=h0) %T>% print()

| [1] 31

# Determine H1
h1 <- 0.7

# Calculate beta
beta.value <- dbinom(0:alpha.value, size, prob=h1) %>% sum() %T>% print()

| [1] 0.1406

# Calculate power
1 - beta.value

| [1] 0.8594

Now, that is better :) If you find it hard to understand what’s going on, you may want to focus on commented section (any line started by #), as they indicates what action we do. Thankfully, we have some ready-to-use packages to do the computation for us (yay to the devs!), so we do not need to reinvent the wheel every time.

Equation in Calculating Sample Size

As in calculating effect sizes, we have numerous equations to find a minimum sample size. None fits all needs, as it depends on our search context. We will see popular ones used in general and biomedical science.

General Equation

$$n = \bigg( \frac{Z_{1 - \frac{\alpha}{2}} + Z_{1-\beta}}{ES} \bigg)^2$$

In general, the culprit causing us much headaches when calculating the sample size is effect sizes. The equation itself only consists of plain arithmetical operations. We shall understand each symbols first before looking the different effect size measures.

$n$: Number of minimal sample size
$Z_{1 - \frac{\alpha}{2}}$: Significance value in a standardized normal distribution
$Z_{1-\beta}$: Power value in a standardized normal distribution
$ES$: Effect size

Dichotomous outcome, one sample

\begin{align} H_0 &: p = p_0 \\ ES &= \frac{p_1 - p_0}{\sqrt{p(1-p)}} \end{align}

Dichotomous outcome, two independent samples

\begin{align} H_0 &: p_1 = p_2 \\ ES &= \frac{|p_1 = p_2|}{\sqrt{p(1-p)}} \end{align}

Continuous outcome, one sample

\begin{align} H_0 &: \mu = \mu_0 \\ ES &= \frac{|\mu_1 = \mu_0|}{\sigma} \end{align}

Continuous outcome, two independent samples

\begin{align} H_0 &: \mu_1 = \mu_2 \\ ES &= \frac{|\mu_1 = \mu_2|}{\sigma} \end{align}

Continuous outcome, two matched samples

\begin{align} H_0 &: \mu_d = 0 \\ ES &= \frac{\mu_d}{\sigma_d} \end{align}

Different study designs may require different solution, where different field of science contributed to countless preferences in performing statistics. What do we do as biomedical scientist? (Charan and Biswas, 2013)

Cross-Sectional

Qualitative variable

$$n = \frac{Z_{1-\frac{\alpha}{2}}^2 \cdot p (1-p)}{d^2}$$

Quantitative variable

$$n = \frac{Z_{1-\frac{\alpha}{2}}^2 \cdot \sigma^2}{d^2}$$

$Z_{1 - \frac{\alpha}{2}}$: Significance value in a standardized normal distribution
$d$: Absolute error as determined by the researcher
$p$: Estimated proportion
$\sigma$: Standard deviation

Case-Control

Qualitative variable

$$n = \frac{r+1}{r} \frac{(p^*)(1-p^*)(Z_{\beta} + Z_{\frac{\alpha}{2}})^2}{(p_1 - p_2)^2}$$

Quantitative variable

$$n = \frac{r+1}{r} \frac{\sigma^2(Z_{\beta} + Z_{\frac{\alpha}{2}})^2}{(p_1 - p_2)^2}$$

$r$: Ratio of control to case
$p^*$: Average of exposed samples proportion
$\sigma$: Standard deviation from previous publication
$p_1 - p_2$: Difference in proportion as previously reported
$Z_{\beta}$: $\beta$ value in a standardized normal distribution

Clinical Trial / Experimental

Qualitative variable

$$n = \frac{2 P(1-P) \cdot (Z_{\frac{\alpha}{2}} + Z_{\beta})^2}{(p_1-p_2)^2}$$

Quantitative variable

$$n = \frac{2\sigma^2 \cdot (Z_{\frac{\alpha}{2}} + Z_{\beta})^2}{d^2}$$

$\sigma$: Standard deviation from previous publication
$P$: Pooled prevalence from both groups
$p_1 - p_2$: Difference in proportion as previously reported

Random Sampling

After understanding what p-value is and calculating the minimum sample size, we are one step closer to conducting our own investigation. We have previously discussed the importance of having independent and identically distributed data, where random sampling is one method to fulfill such an assumption. In random sampling, we can choose either a non-probability or a probability method to acquire our sample. In following sections, we will have a brief explanation on each of available methods.

Non-Probability Random Sampling

Convenience

Based on availability
Representativeness is unknown
Useful in preliminary study

Quota

As in convenient sampling
We set the desired proportion of our sample
Proportion based on specific criteria, e.g. age, sex, etc.

Probability Random Sampling

Simple

Random sample from a list of all subjects in a population
Each subject has an equal chance to participate
Useful in a small population

Systematic

Subject selection not entirely random
As in random sampling, requires an enumeration of all subjects
Systematically select the subject based on a certain criteria, e.g. every $n_{th}$ subject

Stratified / cluster

Split subjects into stratified / clustered groups
Do random sampling from each group
Stratified $\to$ preserves ordinality, i.e. the order is important

Charan, J. and Biswas, T. (2013) How to calculate sample size for different study designs in medical research? Indian Journal of Psychological Medicine [online]. 35 (2), pp. 121. Available from: https://doi.org/10.4103/0253-7176.116232doi:10.4103/0253-7176.116232.

Sample Size and Statistical power

P-value

Significance Level

Power Analysis

Equation in Calculating Sample Size

General Equation

Dichotomous outcome, one sample

Dichotomous outcome, two independent samples

Continuous outcome, one sample

Continuous outcome, two independent samples

Continuous outcome, two matched samples

Cross-Sectional

Qualitative variable

Quantitative variable

Case-Control

Qualitative variable

Quantitative variable

Clinical Trial / Experimental

Qualitative variable

Quantitative variable

Random Sampling

Non-Probability Random Sampling

Convenience

Quota

Probability Random Sampling

Simple

Systematic

Stratified / cluster