Slide

In statistics, proving or rejecting an assumption requires a rigorous formal approach, which highly depends on the formulated hypothesis. The current post will help us in conducting a statistical test to measure hypothesized proportional difference among categorical variables. Another common name assigned to such a formal step is the test of independence. We shall see various tests we have, how they work and when to use them.

Proportional Difference

We may recall from the first lecture that we can measure proportion in a population or sample as a ratio of an event to the total space. As an example, imagine we have 100 participants in a survey, where 60 involved respondents are male and the rest are female. We can calculate the proportion of male participants as \(\frac{60}{100} = 0.6\). In previous lectures, we also performed a number of experiments involving coin flipping and calculating the conditional probability of \(P(X=x|n, p)\), by considering \(X \sim B(n, p)\). What we previously did was proving a \(H_0: \hat{p} = p\), with \(\hat{p}\) being the proportion seen in our sample and \(p\) is the assumed natural occurrence of probability in a fair coin, 0.5. We have yet to coin the term yet (pun intended), but we have grown familiar with a form of hypothesis testing using a binomial test.

Binomial test is a form of exact hypothesis test for one group proportion. With an exact test, we can expect the computed probability of proposed \(H_0\) being true reflects the actual probability. In later sections, we shall observe how calculating an exact probability gets more tedious with larger sample size and more complex comparison. As a solution, we often rely on an approximation, e.g. using proportion test for one group proportion problem as an alternative to binomial test. As we have gotten quite comfortable with the idea of flipping a coin, we may use this example as a proof of concept in distinguishing an exact test from its approximation.

$$Let X \sim B(n, p)$$ $$\texttt{Test the probability of having: } P(X=6 \ |\ 10, 0.5)$$

\begin{align} H_0 &: P(X=6) = 0.5 \\ H_a &: P(X=6) \neq 0.5 \end{align}

estimate statistic p.value parameter conf.low conf.high method alternative
0.6 6 0.754 10 0.262 0.878 Exact binomial test two.sided

(#tab:prop.1)Binomial test

estimate statistic p.value parameter conf.low conf.high method alternative
0.6 0.1 0.752 1 0.274 0.863 1-sample proportions test with continuity correction two.sided

(#tab:prop.1)Proportion test

Using an approximation, the proportion test gives a slightly different results compared to the binomial test (hint: look at the p-value and confidence interval). Now we may perform the same procedure, but using a higher sample space, where we indicate the problem as follows:

$$Let X \sim B(n, p)$$ $$\texttt{Test the probability of having: } P(X=60 \ |\ 100, 0.5)$$

\begin{align} H_0 &: P(X=60) = 0.5 \\ H_a &: P(X=60) \neq 0.5 \end{align}

estimate statistic p.value parameter conf.low conf.high method alternative
0.6 60 0.057 100 0.497 0.697 Exact binomial test two.sided

(#tab:prop.2)Binomial test

estimate statistic p.value parameter conf.low conf.high method alternative
0.6 3.61 0.057 1 0.497 0.695 1-sample proportions test with continuity correction two.sided

(#tab:prop.2)Proportion test

With a larger sample size, the proportion test gives a better estimation of the binomial test. Notice how the p-value and confidence interval produced by the proportion test gets closer to the binomial test, as compared to our previous test. In fact, with sample size of \(n \to \infty\), we can expect a closer approximation to the exact test. The perk of using approximation is its comparably lower computational power and less stringent assumptions.

However, we are often curious to observe multiple variables, i.e. a proportional difference in multiple groups. In such cases, neither of binomial nor proportion test can help us! To identify the problem, first we need to visualize each observation into a contingency table. Then we may apply a more appropriate test, e.g. with Fisher’s exact test or an approximation using Pearson’s Chi-square.

Contingency table is a table with \(m \times n\) cells, usually takes form as a \(2 \times 2\) table. The row and column of a contingency table represents two variables of interest, respectively. With an arbitrary element, we can draw a contingency table as follows:

Outcome 1 Outcome 2
Exposure 1 a b
Exposure 2 c d

To help us visualize a contingency table, suppose we are conducting a market research in Jakarta, where we aim to see how people express their preferences in choosing chain store outlets. We categorized participants based on their place of residency, i.e. in suburban and urban area. The mini-market chain of our interest would be Indomaret and Alfamart. We observed 30 out of 50 respondents in suburban area choose Indomaret, compared to 20 out of 50 respondents in urban area.

Indomaret Alfamart
Suburban 30 20
Urban 20 30

As a test of proportional difference, both Fisher’s exact test and Pearson’s Chi-square can aid us in inferring current data on our market research. On later section we will see their limitation and what use cases are available.

Fisher’s Exact Test

Fisher’s test provides an exact p-value calculation, where it follows a hypergeometric distribution. From the previous lecture, we have learnt what geometric distribution is, i.e. a specific form of binomial distribution, where we are interested to calculate the probability of having one event in \(n\) number of trials. The hypergeometric distribution is somewhat similar to the binomial distribution, with each instance not being identical. In the binomial distribution, we can expect each trial following a Bernoulli trial with identical probability of success. A hypergeometric distribution assumes an event with replacement (kindly recall probability concepts from the urn problem). In other word:

  • Binomial distribution solves the probability of having \(k\) successes within \(n\) number of trials without replacement
  • Geometry distribution finds the probability of having 1 success within \(n\) number of trials without replacement
  • Hypergeometry distribution looks for the probability of having \(k\) successes within \(n\) number of trials with replacement

Looking at proportional differences in general, we can formulate our hypotheses as follow: \begin{align} H_0 &: \hat{p_1} = \hat{p_2} \\ H_a &: \hat{p_1} \neq \hat{p_2} \end{align} with \(\hat{p_i}\) being the proportion in group \(i\).

Since Fisher’s method calculate the exact p-value, we can solve the probability as a permutation problem. Therefore, we can calculate the probability of each event using one of following equations:

\begin{align} P &= \frac{\binom{a + b}{a} \binom{a + b}{b}} {\binom{n}{a + b}} \tag{1} \\ \\ &= \frac{\binom{c + d}{c} \binom{c + d}{d}} {\binom{n}{c + d}} \tag{2} \\ \\ &= \frac{(a+b)!\ (c+d)!\ (a+c)!\ (b+d)!} {a!\ b!\ c!\ d!\ n!} \tag{3} \\ \\ \\ \\ n &= a + b + c + d \end{align}

Seeing the mathematical equation may not sound too appealing for some people, we can also simplify equation 1 into a code, as follow:

fisher.eq <- function(abcd) { # abcd is a list of 4 elements
    a <- abcd[1]; b <- abcd[2]; c <- abcd[3]; d <- abcd[4]
    choose(a+b, a) * choose(a+b, b) / choose(a+b+c+d, a+b)
}

Beware though, as Fisher’s test being an exact approach, we need to calculate all possible extreme events to obtain the p-value. In our case, we have $a, b, c, d$ as an array of \((30,20,20,30)\), so we need to address all extreme events which satisfy \((a,b,c,d) \in \{(31,19,19,31),...,(50,0,0,50)\}\).

a b c d probability
30 20 20 30 0.022
31 19 19 31 0.009
32 18 18 32 0.003
33 17 17 33 0.001
34 16 16 34 0.000
35 15 15 35 0.000
36 14 14 36 0.000
37 13 13 37 0.000
38 12 12 38 0.000
39 11 11 39 0.000
40 10 10 40 0.000
41 9 9 41 0.000
42 8 8 42 0.000
43 7 7 43 0.000
44 6 6 44 0.000
45 5 5 45 0.000
46 4 4 46 0.000
47 3 3 47 0.000
48 2 2 48 0.000
49 1 1 49 0.000
50 0 0 50 0.000

To calculate p-value in a one-tailed test, we simply sum all the probabilities:

sum(tbl$probability)
| [1] 0.0357

Having the product of one-tailed p-value multiplied by \(2\), we can get the two-tailed p-value:

sum(tbl$probability) * 2
| [1] 0.0713

We can compare our calculation with R, which resulted in:

fisher.test(survey, alternative="greater")$p.value
| [1] 0.0357
fisher.test(survey, alternative="two.sided")$p.value
| [1] 0.0713

Chi-square Test of Independence

Previous example demonstrated how Fisher’s test measures exact p-value given a particular condition. We can imagine, with a larger sample space, the computation gets more complicated. To provide an approximation, we may use other measures, such as Chi-square test of independence or G-test. As Chi-square is more ubiquitous, we will limit our discussion on this approach. Please be advised though, different method of Chi-square test exists, where we may choose any based on the assumption on how each variable associates with one another.

Our demonstration on Fisher’s test depicts how arduous the computation can be. Pearson’s Chi-square, as one method of approximation, gives a close estimate to Fisher’s test especially with higher sample size of \(n: n \to \infty\). In Fisher’s test, we often limit our inference to a \(2 \times 2\) contingency table. Pearson’s Chi-square provides a more generalizable construct where we can apply \(m \times n\) contingency tables into calculation. As its name suggests, statistics computed using Pearson’s Chi-square follows a Chi-square distribution, where the degree of freedom \(k\) depends on the number of classes in our variables. Approximating the p-value will requires computing the Chi-square value, where we define:

\begin{align} \chi^2 &= \displaystyle \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \\ E_{ij} &= \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j} \end{align}

\(O:\) Observed outcome
\(E:\) Expected outcome
\(i, j:\) Elements in the contingency table

Knowing the equation, we may compute the Chi-square value using previously described contingency table:

Outcome 1 Outcome 2
Exposure 1 a b
Exposure 2 c d

So we will have our expected outcome as:

$$E_{11} = \frac{(a + b) \cdot (a + c)}{a + b + c + d}$$ $$E_{12} = \frac{(a + b) \cdot (b + d)}{a + b + c + d}$$

$$E_{21} = \frac{(c + d) \cdot (a + c)}{a + b + c + d}$$

$$E_{22} = \frac{(c + d) \cdot (b + d)}{a + b + c + d}$$

Considering the value in \((a, b, c, d)\):

Indomaret Alfamart
Suburban 30 20
Urban 20 30

We can calculate the expected outcomes:

\begin{align} E_{ij} &= \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j} \\ \\ E_{11} &= 25 \\ E_{12} &= 25 \\ E_{21} &= 25 \\ E_{22} &= 25 \end{align}

Then we can compute our Chi-square statistics:

\begin{align} \chi^2 &= \displaystyle \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \\ &= \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} + \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} \\ &= 4 \end{align}

Great! But it has yet to cough up the p-value. To obtain our p-value, we need to know where \(\chi^2=4\) is located in a Chi-square distribution quantile with a degree of freedom \(k=1\).

1 - pchisq(4, df=1)
| [1] 0.0455

How does our finding compare to R?

chisq.test(survey, correct=FALSE)$p.value
| [1] 0.0455

Test of Independence with Paired Samples

When conducting a research, sometimes we are more inclined to see how the same individual expresses different measurement, either by measuring in a different time or using a different measurement scale. In other word, we have same samples and variables, yet different measures, i.e. a paired sample. By mathematical design, Pearson’s Chi-square could not determine differences happening overtime, as it only depicts proportional differences from a given contingency table. As a solution, McNemar’s Chi-square provides a more appropriate estimate of p-value on how the proportion changes by respecting its temporal order.

We will consider the following scenario: Suppose we continue our market research, where we ask exactly same subjects three months later. We expected no changes in their preferences of chain-store outlets. It turned out, regardless of their area of residence, 25 people who previously preferred go to Indomaret now shop in Alfamart. Meanwhile, 20 people who used to visit Alfamart now prefer Indomaret.

To capture changes overtime, we can formulate our contingency table as follow:

Indomaret Alfamart
Indomaret 25 25
Alfamart 20 30

Then, we will set our hypothesis as:

\begin{align} H_0 &: \hat{p_{t_0}} = \hat{p_{t_1}} \\ H_1 &: \hat{p_{t_0}} \neq \hat{p_{t_1}} \end{align}

To calculate the Chi-square statistics, McNemar’s proposed following equation:

$$\chi^2 = \frac{(b-c)^2}{b+c}$$

In R, we can have the p-value by issuing:

mcnemar.test(survey2)
| 
|   McNemar's Chi-squared test with continuity correction
| 
| data:  survey2
| McNemar's chi-squared = 0.4, df = 1, p-value = 0.6

Applying Yates’ Correction

Upon reading the section on exact test and its approximation, we may have wondered why they provide different results? Is it only depends on the sample size, or rather, will we have similar results when \(n \to \infty\)? The answer is not quite straightforward, and this section will potentially add more confusion to the question. When conducting statistical inference, we ought to satisfy some assumptions on a particular test of our interest. In Fisher’s test, we need to have a fixed marginal total. Looking at our dummy contingency table, marginal total is simply the sum of all cells, i.e. \(a+b+c+d\). It is not a stringent assumption per se, and I personally have not found a strong evidence on which violation may affect the test robustness. However, as a rule of thumb based on convenient convention, we only apply Fisher’s test when we have a low sample count.

Now, we can ask a different question, how low can we consider our data before applying Fisher’s test? Again, it is a rule of thumb, first we need to have our contingency table ready for Pearson’s Chi-square test. When we calculate our expected outcome and have any cell < 5, Fisher’s is a more appropriate test to conduct. However, in most cases, we may find our data as unsuitable for Fisher’s exact test, yet we are not sure whether Pearson’s Chi-square will give a good approximation.

In such a condition, we will have Yates’ correction to give a better estimate. Yates’ method provides a better estimates and alleviates bias in a \(2 \times 2\) contingency table. With a larger contingency table, we often do not require Yates’ correction. If we recall Pearson’s Chi-square equation, we can apply Yates’ correction in following fashion:

$$\chi^2 = \displaystyle \sum_{i, j} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}} \\$$

Concluding Remakrs

  • Large sample (> 10 in each cell) \(\to\) use approximation
  • Low sample \(\to\) use an exact test
  • \(2 \times 2\) contingency with approximation \(\to\) apply Yates’ correction
  • Low sample with \(m \times n\) contingency table \(\to\) split or do simulation