Hypotheiss Test: Proportional Difference

.bg-main4.column[.vmiddle.content[
.amber[Aly Lamuri]  
Indonesia Medical Education and Research Institute
]]

---

---

.bg-main1.column[.vmiddle.content[
- .amber[Proportional difference]
- Exact test
- Approximation
- Paired sample
- Applying Yates' correction
]]

---

# Proportional difference

.font2[
- Concept recall: proportion in population and sample?
- So far, we relied on the binomial test
- It is an *exact* measure of .amber[one] proportion
- Another test to consider: .amber[proportion test]
]

---

???

- An exact measure: we exactly measure the p-value
- It is computationally demanding
- Hard to conduct with a large sample size
- In such cases, we may want to choose approximation

---

# Example?

`$$Let X \sim B(n, p)$$`
`$$\texttt{Test the probability of having: } P(X=6 \ |\ 10, 0.5)$$`

`\begin{align}
H_0 &: P(X=6) = 0.5 \\
H_a &: P(X=6) \neq 0.5
\end{align}`

---

.bg-white.content[
<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
<caption>Binomial test</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> parameter </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
   <th style="text-align:left;"> method </th>
   <th style="text-align:left;"> alternative </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.6 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:right;"> 0.754 </td>
   <td style="text-align:right;"> 10 </td>
   <td style="text-align:right;"> 0.262 </td>
   <td style="text-align:right;"> 0.878 </td>
   <td style="text-align:left;"> Exact binomial test </td>
   <td style="text-align:left;"> two.sided </td>
  </tr>
</tbody>
</table>

<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
<caption>Proportion test</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> parameter </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
   <th style="text-align:left;"> method </th>
   <th style="text-align:left;"> alternative </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.6 </td>
   <td style="text-align:right;"> 0.1 </td>
   <td style="text-align:right;"> 0.752 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.274 </td>
   <td style="text-align:right;"> 0.863 </td>
   <td style="text-align:left;"> 1-sample proportions test with continuity correction </td>
   <td style="text-align:left;"> two.sided </td>
  </tr>
</tbody>
</table>
]

---

# Another example?

`$$Let X \sim B(n, p)$$`
`$$\texttt{Test the probability of having: } P(X=60 \ |\ 100, 0.5)$$`

`\begin{align}
H_0 &: P(X=60) = 0.5 \\
H_a &: P(X=60) \neq 0.5
\end{align}`

---

.bg-white.content[
<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
<caption>Binomial test</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> parameter </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
   <th style="text-align:left;"> method </th>
   <th style="text-align:left;"> alternative </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.6 </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 0.057 </td>
   <td style="text-align:right;"> 100 </td>
   <td style="text-align:right;"> 0.497 </td>
   <td style="text-align:right;"> 0.697 </td>
   <td style="text-align:left;"> Exact binomial test </td>
   <td style="text-align:left;"> two.sided </td>
  </tr>
</tbody>
</table>

<table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'>
<caption>Proportion test</caption>
 <thead>
  <tr>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:right;"> parameter </th>
   <th style="text-align:right;"> conf.low </th>
   <th style="text-align:right;"> conf.high </th>
   <th style="text-align:left;"> method </th>
   <th style="text-align:left;"> alternative </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 0.6 </td>
   <td style="text-align:right;"> 3.61 </td>
   <td style="text-align:right;"> 0.057 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 0.497 </td>
   <td style="text-align:right;"> 0.695 </td>
   <td style="text-align:left;"> 1-sample proportions test with continuity correction </td>
   <td style="text-align:left;"> two.sided </td>
  </tr>
</tbody>
</table>
]

---

# What do we learn?

.font2[
- With a low sample size, an exact test is more .amber[appropriate]
- When sample size `$n \to \infty:$` approximation gives closer estimates
- An approximation relies on lower computational power
]

---

# But...

.font2[
- Often we are more interested in multiple variables
- We may want to see .amber[proportional differences] in multiple groups
- In such cases, neither binomial test nor proportion test can help us!
]

# What can we do?

.font2[
- Visualize our problem as a .amber[contingency table]
- Use a more appropriate statistical test:
  - Fisher's exact test
  - Pearson's Chi-square
]

???

- Remember the last time we talked about Chi-square distribution?
- We'll use a lot of them in later sections :)

---

# Contingency table

.font2[ 
- A table outlining our problem :)
- Each element represents a .amber[count] of variables in our interest
]

---

# How does it look like?

???

Fun fact: The contingency table is also called a cross tabulation

.bg-white[
<br>
<table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> Outcome 1 </th>
   <th style="text-align:left;"> Outcome 2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Exposure 1 </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> b </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exposure 2 </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:left;"> d </td>
  </tr>
</tbody>
</table>
<br>
]

---

# Example?

.font2[
We are conducting a market research in Jakarta, where we aim to see how people
express their preferences in choosing chain store outlets. We categorized
participants based on their place of residency, i.e. in .amber[suburban] and
.amber[urban] area. The mini-market chain of our interest would be
.amber[Indomaret] and .amber[Alfamart]. We observed .pink[30 out of 50]
respondents in suburban area choose Indomaret, compared to .pink[20 out of 50]
respondents in urban area.
]

---

.bg-white[
<br>
<table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Indomaret </th>
   <th style="text-align:right;"> Alfamart </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Suburban </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urban </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 30 </td>
  </tr>
</tbody>
</table>
<br>
]

???

- We can do exact test and approximation
- Fisher's exact test
- Pearson's Chi-square (Approximation)
- We will see what limitation each approach has and their use cases

---

.bg-main1.column[.vmiddle.content[
- Proportional difference
- .amber[Exact test]
- Approximation
- Paired sample
- Applying Yates' correction
]]

---

# Fisher's exact test

.font2[
- Follows a hypergeometric distribution
- Concept recall: what is a geometric distribution?
- Extending previous concepts: what is a hypergeometric distribution?
]

???

- Geometric distribution: get 1 success after `$n$` number of trials *with*
  replacement
- Hypergeometric distribution: get `$k$` successes after `$n$` number of trials
  *without* replacement
- We shall see the hypergeometric distribution as an extension to binomial
  distribution
- In binomial distribution, we only consider *identical* probability
  (probability of an event with replacement)

# How do we formulate the hypothesis?

.font2[
`\begin{align}
H_0 &: \hat{p_1} = \hat{p_2} \\
H_a &: \hat{p_1} \neq \hat{p_2} \\
\end{align}`
]

???

- `$\hat{p_i}:$` Proportion in group i

---

# How do we calculate the probability?

`\begin{align}
P &= \frac{\binom{a + b}{a} \binom{a + b}{b}} {\binom{n}{a + b}} \tag{1} \\
  \\
  &= \frac{\binom{c + d}{c} \binom{c + d}{d}} {\binom{n}{c + d}} \tag{2} \\
  \\
  &= \frac{(a+b)!\ (c+d)!\ (a+c)!\ (b+d)!} {a!\ b!\ c!\ d!\ n!} \tag{3} \\
  \\
  \\
  \\
n &= a + b + c + d
\end{align}`

???

You may choose any of those equations

---

# In code, please?

```r
fisher.eq <- function(abcd) { # abcd is a list of 4 elements
	a <- abcd[1]; b <- abcd[2]; c <- abcd[3]; d <- abcd[4]
	choose(a+b, a) * choose(a+b, b) / choose(a+b+c+d, a+b)
}
```

# Let's solve our case!

.bg-white[
<br>
<table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Indomaret </th>
   <th style="text-align:right;"> Alfamart </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Suburban </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urban </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 30 </td>
  </tr>
</tbody>
</table>
<br>
]

???

- Fisher's is an exact test
- Which mean, we need take ALL possible outcomes into account

---

# Fisher's equation solution

<div id="htmlwidget-d5a1c8a2ff9de598ae85" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-d5a1c8a2ff9de598ae85">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21"],[30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50],[20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0],[20,19,18,17,16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0],[30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50],[0.0220153934585849,0.00916353525851611,0.00323050412922296,0.000961141724396915,0.000240285431099229,5.02147513154306e-05,8.71783877004004e-06,1.24813469607586e-06,1.46076706119681e-07,1.38297473249402e-08,1.04587464144861e-09,6.22174087714816e-11,2.85692183134354e-12,9.88875052493168e-14,2.50283458533911e-15,4.44948370726954e-17,5.25695144998764e-19,3.80766062470811e-21,1.48736743152661e-23,2.47791325535461e-26,9.91165302141845e-30]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>a<\/th>\n      <th>b<\/th>\n      <th>c<\/th>\n      <th>d<\/th>\n      <th>probability<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[1,2,3,4,5]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

<br>
]

---

# Calculate the p-value

## One-tailed test

```r
sum(tbl$probability)
```

```
## [1] 0.0357
```

## Two-tailed test

```r
sum(tbl$probability) * 2
```

```
## [1] 0.0713
```

---

# Let `R` do the hard stuff for us

---

## One-tailed test

```r
fisher.test(survey, alternative="greater")
```

```
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  survey
## p-value = 0.04
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  1.06  Inf
## sample estimates:
## odds ratio 
##       2.23
```

---

## Two-tailed test

```r
fisher.test(survey, alternative="two.sided")
```

```
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  survey
## p-value = 0.07
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.94 5.41
## sample estimates:
## odds ratio 
##       2.23
```

---

# A wild .pink[homework] has appeared!

### Perform Fisher's exact test on following scenario:

- `$a:$` 40
- `$b:$` 15
- `$c:$` 15
- `$d:$` 20

### Task:

- Find the p-value for .abmer[one-tailed] test
- Find the p-value for .amber[two-tailed] test

### Rules:

- Apply Fisher's equation to solve the problem
- You may use calculator or code on your own
- Present me the table of your calculation
- .pink[Do not] use pre-existing package! (`numpy` is allowed though)

---

# Can you get similar solution when computing by hands?

```r
c(40, 15, 15, 20) %>% matrix(nrow=2) %>% fisher.test()
```

```
## 
## 	Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 0.007
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.33 9.58
## sample estimates:
## odds ratio 
##        3.5
```

---

.bg-main1.column[.vmiddle.content[
- Proportional difference
- Exact test
- .amber[Approximation]
- Paired sample
- Applying Yates' correction
]]

---

# Approximating Fisher's solution

.font2[
- There are several approaches we may follow
- Pearson's Chi-square and G-test are popular ones
- We will only look into Chi-square
]

???

- Different method of Chi-square computation exists
- We have *goodness of fit* and *test of independence*
- Choose your method wisely

---

# Why an approximation?

.font2[
- As we have seen, an exact calculation is .amber[arduous]
- Larger sample size requires a higher .amber[computational power]
- And it often applies only for `$2 \times 2$` contingency table
- An approximation is more flexible
- It can do even an `$m \times n$` contingency table
]

---

# Chi-square test of independence

- Statistical computation follows a Chi-square distribution
- Degree of freedom `$k$` depends on the number of classes `$X, Y$`
]

# Example

- Outcome 1: 2 classes, outcome 2: 2 classes `$\to k=1$`
- Outcome 1: 2 classes, outcome 2: 3 classes `$\to k=2$`
- Outcome 1: 3 classes, outcome 2: 3 classes `$\to k=4$`

---

# Calculating Chi-square

.font2[
`\begin{align}
\chi^2 &= \displaystyle \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \\
E_{ij} &= \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j}
\end{align}`

`$O:$` Observed outcome  
`$E:$` Expected outcome  
`$i, j:$` Elements in the contingency table
]

---

.bg-white.column[.vmiddle.content[
<table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:left;"> Outcome 1 </th>
   <th style="text-align:left;"> Outcome 2 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Exposure 1 </td>
   <td style="text-align:left;"> a </td>
   <td style="text-align:left;"> b </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Exposure 2 </td>
   <td style="text-align:left;"> c </td>
   <td style="text-align:left;"> d </td>
  </tr>
</tbody>
</table>
]]

{{content}}
]

]]

---

`$$E_{11} = \frac{(a + b) \cdot (a + c)}{a + b + c + d}$$`

---

`$$E_{12} = \frac{(a + b) \cdot (b + d)}{a + b + c + d}$$`

---

`$$E_{21} = \frac{(c + d) \cdot (a + c)}{a + b + c + d}$$`

---

`$$E_{22} = \frac{(c + d) \cdot (b + d)}{a + b + c + d}$$`

---

.bg-white.column[.vmiddle.content[
<table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Indomaret </th>
   <th style="text-align:right;"> Alfamart </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Suburban </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:right;"> 20 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Urban </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 30 </td>
  </tr>
</tbody>
</table>
]]

---

## Calculating Expected outcome

`\begin{align}
E_{ij} &= \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j} \\
\\
E_{11} &= 25 \\
E_{12} &= 25 \\
E_{21} &= 25 \\
E_{22} &= 25
\end{align}`

---

# Calculating `$\chi^2$` statistics

`\begin{align}
\chi^2 &= \displaystyle \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \\
       &= \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} + \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} \\
       &= 4
\end{align}`

---

# Determining p-value

```r
1 - pchisq(4, df=1)
```

```
## [1] 0.0455
```

---

# Built-in function in `R`

```r
chisq.test(survey, correct=FALSE)$p.value
```

```
## [1] 0.0455
```

---

.bg-main1.column[.vmiddle.content[
- Proportional difference
- Exact test
- Approximation
- .amber[Paired sample]
- Applying Yates' correction
]]

---

# Paired samples in the contingency table

???

- When we do a longitudinal study
- We have the same sample, but measured in different time
- We need to take into account differences occurring overtime
- Pearson's Chi-square could not address this issue
- Solution: McNemar's Chi-square

# Example?

.font2[
Suppose we continue our market research, where we ask .amber[**exactly same**]
subjects *three months* later. We expected no changes in their preferences of
chain-store outlets. It turned out, regardless of their area of residence,
.pink[25 people] who previously preferred go to Indomaret now shop in Alfamart.
Meanwhile, .pink[20 people] who used to visit Alfamart now prefer Indomaret.
]

---

<table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Indomaret </th>
   <th style="text-align:right;"> Alfamart </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Indomaret </td>
   <td style="text-align:right;"> 25 </td>
   <td style="text-align:right;"> 25 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Alfamart </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 30 </td>
  </tr>
</tbody>
</table>

<br>
]
]]

.font2[
`\begin{align}
H_0 &: \hat{p_{t_0}} = \hat{p_{t_1}} \\
H_1 &: \hat{p_{t_0}} \neq \hat{p_{t_1}}
\end{align}`
]
]]

---

# McNemar's Chi-square

```r
mcnemar.test(survey2)
```

```
## 
## 	McNemar's Chi-squared test with continuity correction
## 
## data:  survey2
## McNemar's chi-squared = 0.4, df = 1, p-value = 0.6
```

---

.bg-main1.column[.vmiddle.content[
- Proportional difference
- Exact test
- Approximation
- Paired sample
- .amber[Applying Yates' correction]
]]

---

# Yates' correction

.font2[
- Only applied to approximation test
- Alleviates bias in a `$2 \times 2$` contingency table
- Especially useful when having low count (< 10)
- In extremely low sample count (< 5), use an exact test instead
]

---

# Lesson learnt

.font2[
- Large sample (> 10 in each cell) `$\to$` use approximation
- Low sample `$\to$` use an exact test
- `$2 \times 2$` contingency with approximation `$\to$` apply Yates' correction
- Low sample with `$m \times n$` contingency table `$\to$` split or do simulation
]

---

.amber.font5[Query?]