class: bg-main1 hide-slide-number split-70 count: false .column[.right.vmiddle.content[ .font3[Hypothesis Test: .amber[Proportional Difference]] ]] .bg-main4.column[.vmiddle.content[ .amber[Aly Lamuri] Indonesia Medical Education and Research Institute ]] --- name: overview layout: true class: bg-main4 middle split-30 hide-slide-number .column[.vmiddle.right.content[ .amber.font3[Overview] ]] --- template: overview count: false .bg-main1.column[.vmiddle.content[ - .amber[Proportional difference] - Exact test - Approximation - Paired sample - Applying Yates' correction ]] --- layout: true class: bg-main3 # Proportional difference .font2[ - Concept recall: proportion in population and sample? - So far, we relied on the binomial test - It is an *exact* measure of .amber[one] proportion - Another test to consider: .amber[proportion test] ] --- ??? - An exact measure: we exactly measure the p-value - It is computationally demanding - Hard to conduct with a large sample size - In such cases, we may want to choose approximation --- count: false # Example? `$$Let X \sim B(n, p)$$` `$$\texttt{Test the probability of having: } P(X=6 \ |\ 10, 0.5)$$` -- `\begin{align} H_0 &: P(X=6) = 0.5 \\ H_a &: P(X=6) \neq 0.5 \end{align}` --- count: false .bg-white.content[ <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <caption>Binomial test</caption> <thead> <tr> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> parameter </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> <th style="text-align:left;"> method </th> <th style="text-align:left;"> alternative </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 0.754 </td> <td style="text-align:right;"> 10 </td> <td style="text-align:right;"> 0.262 </td> <td style="text-align:right;"> 0.878 </td> <td style="text-align:left;"> Exact binomial test </td> <td style="text-align:left;"> two.sided </td> </tr> </tbody> </table> <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <caption>Proportion test</caption> <thead> <tr> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> parameter </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> <th style="text-align:left;"> method </th> <th style="text-align:left;"> alternative </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 0.1 </td> <td style="text-align:right;"> 0.752 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.274 </td> <td style="text-align:right;"> 0.863 </td> <td style="text-align:left;"> 1-sample proportions test with continuity correction </td> <td style="text-align:left;"> two.sided </td> </tr> </tbody> </table> ] --- count: false # Another example? `$$Let X \sim B(n, p)$$` `$$\texttt{Test the probability of having: } P(X=60 \ |\ 100, 0.5)$$` -- `\begin{align} H_0 &: P(X=60) = 0.5 \\ H_a &: P(X=60) \neq 0.5 \end{align}` --- count: false .bg-white.content[ <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <caption>Binomial test</caption> <thead> <tr> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> parameter </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> <th style="text-align:left;"> method </th> <th style="text-align:left;"> alternative </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 0.057 </td> <td style="text-align:right;"> 100 </td> <td style="text-align:right;"> 0.497 </td> <td style="text-align:right;"> 0.697 </td> <td style="text-align:left;"> Exact binomial test </td> <td style="text-align:left;"> two.sided </td> </tr> </tbody> </table> <table class=" lightable-paper" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; margin-left: auto; margin-right: auto;'> <caption>Proportion test</caption> <thead> <tr> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:right;"> parameter </th> <th style="text-align:right;"> conf.low </th> <th style="text-align:right;"> conf.high </th> <th style="text-align:left;"> method </th> <th style="text-align:left;"> alternative </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.6 </td> <td style="text-align:right;"> 3.61 </td> <td style="text-align:right;"> 0.057 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.497 </td> <td style="text-align:right;"> 0.695 </td> <td style="text-align:left;"> 1-sample proportions test with continuity correction </td> <td style="text-align:left;"> two.sided </td> </tr> </tbody> </table> ] --- count: false # What do we learn? .font2[ - With a low sample size, an exact test is more .amber[appropriate] - When sample size `\(n \to \infty:\)` approximation gives closer estimates - An approximation relies on lower computational power ] --- layout: false class: bg-main3 # But... .font2[ - Often we are more interested in multiple variables - We may want to see .amber[proportional differences] in multiple groups - In such cases, neither binomial test nor proportion test can help us! ] -- # What can we do? .font2[ - Visualize our problem as a .amber[contingency table] - Use a more appropriate statistical test: - Fisher's exact test - Pearson's Chi-square ] ??? - Remember the last time we talked about Chi-square distribution? - We'll use a lot of them in later sections :) --- layout: true class: bg-main3 # Contingency table .font2[ - A table outlining our problem :) - Each element represents a .amber[count] of variables in our interest ] --- # How does it look like? ??? Fun fact: The contingency table is also called a cross tabulation -- .bg-white[ <br> <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Outcome 1 </th> <th style="text-align:left;"> Outcome 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Exposure 1 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> b </td> </tr> <tr> <td style="text-align:left;"> Exposure 2 </td> <td style="text-align:left;"> c </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> <br> ] --- count: false # Example? -- .font2[ We are conducting a market research in Jakarta, where we aim to see how people express their preferences in choosing chain store outlets. We categorized participants based on their place of residency, i.e. in .amber[suburban] and .amber[urban] area. The mini-market chain of our interest would be .amber[Indomaret] and .amber[Alfamart]. We observed .pink[30 out of 50] respondents in suburban area choose Indomaret, compared to .pink[20 out of 50] respondents in urban area. ] --- count: false .bg-white[ <br> <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Indomaret </th> <th style="text-align:right;"> Alfamart </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Suburban </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 30 </td> </tr> </tbody> </table> <br> ] -- .font2[How do we test for differences?] ??? - We can do exact test and approximation - Fisher's exact test - Pearson's Chi-square (Approximation) - We will see what limitation each approach has and their use cases --- template: overview count: false .bg-main1.column[.vmiddle.content[ - Proportional difference - .amber[Exact test] - Approximation - Paired sample - Applying Yates' correction ]] --- layout: false class: bg-main3 # Fisher's exact test .font2[ - Follows a hypergeometric distribution - Concept recall: what is a geometric distribution? - Extending previous concepts: what is a hypergeometric distribution? ] ??? - Geometric distribution: get 1 success after `\(n\)` number of trials *with* replacement - Hypergeometric distribution: get `\(k\)` successes after `\(n\)` number of trials *without* replacement - We shall see the hypergeometric distribution as an extension to binomial distribution - In binomial distribution, we only consider *identical* probability (probability of an event with replacement) -- # How do we formulate the hypothesis? -- .font2[ `\begin{align} H_0 &: \hat{p_1} = \hat{p_2} \\ H_a &: \hat{p_1} \neq \hat{p_2} \\ \end{align}` ] ??? - `\(\hat{p_i}:\)` Proportion in group i --- class: bg-main3 # How do we calculate the probability? `\begin{align} P &= \frac{\binom{a + b}{a} \binom{a + b}{b}} {\binom{n}{a + b}} \tag{1} \\ \\ &= \frac{\binom{c + d}{c} \binom{c + d}{d}} {\binom{n}{c + d}} \tag{2} \\ \\ &= \frac{(a+b)!\ (c+d)!\ (a+c)!\ (b+d)!} {a!\ b!\ c!\ d!\ n!} \tag{3} \\ \\ \\ \\ n &= a + b + c + d \end{align}` ??? You may choose any of those equations --- class: bg-main3 # In code, please? ```r fisher.eq <- function(abcd) { # abcd is a list of 4 elements a <- abcd[1]; b <- abcd[2]; c <- abcd[3]; d <- abcd[4] choose(a+b, a) * choose(a+b, b) / choose(a+b+c+d, a+b) } ``` -- # Let's solve our case! .bg-white[ <br> <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Indomaret </th> <th style="text-align:right;"> Alfamart </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Suburban </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 30 </td> </tr> </tbody> </table> <br> ] ??? - Fisher's is an exact test - Which mean, we need take ALL possible outcomes into account --- class: bg-main3 count: false # Fisher's equation solution .bg-white[ <br>
<br> ] --- class: bg-main3 # Calculate the p-value ## One-tailed test ```r sum(tbl$probability) ``` ``` ## [1] 0.0357 ``` -- ## Two-tailed test ```r sum(tbl$probability) * 2 ``` ``` ## [1] 0.0713 ``` --- layout: true class: bg-main3 # Let `R` do the hard stuff for us --- count: false ## One-tailed test ```r fisher.test(survey, alternative="greater") ``` ``` ## ## Fisher's Exact Test for Count Data ## ## data: survey ## p-value = 0.04 ## alternative hypothesis: true odds ratio is greater than 1 ## 95 percent confidence interval: ## 1.06 Inf ## sample estimates: ## odds ratio ## 2.23 ``` --- count: false ## Two-tailed test ```r fisher.test(survey, alternative="two.sided") ``` ``` ## ## Fisher's Exact Test for Count Data ## ## data: survey ## p-value = 0.07 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 0.94 5.41 ## sample estimates: ## odds ratio ## 2.23 ``` --- layout: false class: bg-main3 # A wild .pink[homework] has appeared! ### Perform Fisher's exact test on following scenario: - `\(a:\)` 40 - `\(b:\)` 15 - `\(c:\)` 15 - `\(d:\)` 20 ### Task: - Find the p-value for .abmer[one-tailed] test - Find the p-value for .amber[two-tailed] test -- ### Rules: - Apply Fisher's equation to solve the problem - You may use calculator or code on your own - Present me the table of your calculation - .pink[Do not] use pre-existing package! (`numpy` is allowed though) --- class: bg-main3 count: false # Can you get similar solution when computing by hands? ```r c(40, 15, 15, 20) %>% matrix(nrow=2) %>% fisher.test() ``` ``` ## ## Fisher's Exact Test for Count Data ## ## data: . ## p-value = 0.007 ## alternative hypothesis: true odds ratio is not equal to 1 ## 95 percent confidence interval: ## 1.33 9.58 ## sample estimates: ## odds ratio ## 3.5 ``` --- template: overview count: false .bg-main1.column[.vmiddle.content[ - Proportional difference - Exact test - .amber[Approximation] - Paired sample - Applying Yates' correction ]] --- class: bg-main3 # Approximating Fisher's solution .font2[ - There are several approaches we may follow - Pearson's Chi-square and G-test are popular ones - We will only look into Chi-square ] ??? - Different method of Chi-square computation exists - We have *goodness of fit* and *test of independence* - Choose your method wisely --- class: bg-main3 count: false # Why an approximation? .font2[ - As we have seen, an exact calculation is .amber[arduous] - Larger sample size requires a higher .amber[computational power] - And it often applies only for `\(2 \times 2\)` contingency table - An approximation is more flexible - It can do even an `\(m \times n\)` contingency table ] --- class: bg-main3 # Chi-square test of independence .font2[ `$$stat \sim \chi^2(k)$$` - Statistical computation follows a Chi-square distribution - Degree of freedom `\(k\)` depends on the number of classes `\(X, Y\)` ] -- # Example - Outcome 1: 2 classes, outcome 2: 2 classes `\(\to k=1\)` - Outcome 1: 2 classes, outcome 2: 3 classes `\(\to k=2\)` - Outcome 1: 3 classes, outcome 2: 3 classes `\(\to k=4\)` --- class: bg-main3 # Calculating Chi-square .font2[ `\begin{align} \chi^2 &= \displaystyle \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \\ E_{ij} &= \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j} \end{align}` `\(O:\)` Observed outcome `\(E:\)` Expected outcome `\(i, j:\)` Elements in the contingency table ] --- layout: true class: bg-main3 split-two .bg-white.column[.vmiddle.content[ <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> Outcome 1 </th> <th style="text-align:left;"> Outcome 2 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Exposure 1 </td> <td style="text-align:left;"> a </td> <td style="text-align:left;"> b </td> </tr> <tr> <td style="text-align:left;"> Exposure 2 </td> <td style="text-align:left;"> c </td> <td style="text-align:left;"> d </td> </tr> </tbody> </table> ]] .column[.vmiddle.center.content[ ## Calculating Expected outcome .font2[ `$$E_{ij} = \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j}$$` {{content}} ] ]] --- count: false `$$E_{11} = \frac{(a + b) \cdot (a + c)}{a + b + c + d}$$` --- count: false `$$E_{12} = \frac{(a + b) \cdot (b + d)}{a + b + c + d}$$` --- count: false `$$E_{21} = \frac{(c + d) \cdot (a + c)}{a + b + c + d}$$` --- count: false `$$E_{22} = \frac{(c + d) \cdot (b + d)}{a + b + c + d}$$` --- layout: true class: bg-main3 split-two .bg-white.column[.vmiddle.content[ <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Indomaret </th> <th style="text-align:right;"> Alfamart </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Suburban </td> <td style="text-align:right;"> 30 </td> <td style="text-align:right;"> 20 </td> </tr> <tr> <td style="text-align:left;"> Urban </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 30 </td> </tr> </tbody> </table> ]] .column[.vmiddle.center.content[ {{content}} ]] --- ## Calculating Expected outcome `\begin{align} E_{ij} &= \frac{\sum O_i \cdot \sum O_j}{\sum O_i + O_j} \\ \\ E_{11} &= 25 \\ E_{12} &= 25 \\ E_{21} &= 25 \\ E_{22} &= 25 \end{align}` --- # Calculating `\(\chi^2\)` statistics `\begin{align} \chi^2 &= \displaystyle \sum_{i, j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \\ &= \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} + \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} \\ &= 4 \end{align}` --- count: false # Determining p-value ```r 1 - pchisq(4, df=1) ``` ``` ## [1] 0.0455 ``` --- count: false # Built-in function in `R` ```r chisq.test(survey, correct=FALSE)$p.value ``` ``` ## [1] 0.0455 ``` --- template: overview count: false .bg-main1.column[.vmiddle.content[ - Proportional difference - Exact test - Approximation - .amber[Paired sample] - Applying Yates' correction ]] --- layout: false class: bg-main3 # Paired samples in the contingency table .font2[ - What is a paired sample? - Why use different measure? - How do we solve it? ] ??? - When we do a longitudinal study - We have the same sample, but measured in different time - We need to take into account differences occurring overtime - Pearson's Chi-square could not address this issue - Solution: McNemar's Chi-square -- # Example? .font2[ Suppose we continue our market research, where we ask .amber[**exactly same**] subjects *three months* later. We expected no changes in their preferences of chain-store outlets. It turned out, regardless of their area of residence, .pink[25 people] who previously preferred go to Indomaret now shop in Alfamart. Meanwhile, .pink[20 people] who used to visit Alfamart now prefer Indomaret. ] --- class: bg-main3 split-two couht: false .column[.vmiddle.center.content[ # Contingency table .bg-white[ <br> <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Indomaret </th> <th style="text-align:right;"> Alfamart </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Indomaret </td> <td style="text-align:right;"> 25 </td> <td style="text-align:right;"> 25 </td> </tr> <tr> <td style="text-align:left;"> Alfamart </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 30 </td> </tr> </tbody> </table> <br> ] ]] .column[.vmiddle.center.content[ # Hypothesis .font2[ `\begin{align} H_0 &: \hat{p_{t_0}} = \hat{p_{t_1}} \\ H_1 &: \hat{p_{t_0}} \neq \hat{p_{t_1}} \end{align}` ] ]] --- class: bg-main3 # McNemar's Chi-square .font2[ `$$\chi^2 = \frac{(b-c)^2}{b+c}$$` ] -- ```r mcnemar.test(survey2) ``` ``` ## ## McNemar's Chi-squared test with continuity correction ## ## data: survey2 ## McNemar's chi-squared = 0.4, df = 1, p-value = 0.6 ``` --- template: overview .bg-main1.column[.vmiddle.content[ - Proportional difference - Exact test - Approximation - Paired sample - .amber[Applying Yates' correction] ]] --- class: bg-main3 # Yates' correction .font2[ - Only applied to approximation test - Alleviates bias in a `\(2 \times 2\)` contingency table - Especially useful when having low count (< 10) - In extremely low sample count (< 5), use an exact test instead ] -- .font2[ `$$\chi^2 = \displaystyle \sum_{i, j} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}} \\$$` ] --- class: bg-main3 # Lesson learnt .font2[ - Large sample (> 10 in each cell) `\(\to\)` use approximation - Low sample `\(\to\)` use an exact test - `\(2 \times 2\)` contingency with approximation `\(\to\)` apply Yates' correction - Low sample with `\(m \times n\)` contingency table `\(\to\)` split or do simulation ] --- class: bg-main3 hide-slide-number center middle count: false .amber.font5[Query?]