Parametric: Mean in Two Groups
Aly Lamuri
Indonesia Medical Education and Research Institute
Overview
x−¯xs∼N(0,1)(1)
¯Xd→N(μ,σ√n)(2)
x−¯xs∼N(0,1)(1)
¯Xd→N(μ,σ√n)(2)
With known μ and σ, we can make a direct comparison
Ideal: by knowing the parameter μ and σ, we can directly compare our sample mean to its corresponding population
x−¯xs∼N(0,1)(1)
¯Xd→N(μ,σ√n)(2)
With known μ and σ, we can make a direct comparison
But...
Ideal: by knowing the parameter μ and σ, we can directly compare our sample mean to its corresponding population
x−¯xs∼N(0,1)(1)
¯Xd→N(μ,σ√n)(2)
With known μ and σ, we can make a direct comparison
But...
How if we don't know μ?
Ideal: by knowing the parameter μ and σ, we can directly compare our sample mean to its corresponding population
Solution: use its statistics ¯x as an estimate
SE=σ√nz=¯x−μ0SE=¯x−μ0σ/√n(Standard Error)(One-sample Test)
SE=σ√nz=¯x−μ0SE=¯x−μ0σ/√n(Standard Error)(One-sample Test)
How do we get the p-value?
SE=σ√nz=¯x−μ0SE=¯x−μ0σ/√n(Standard Error)(One-sample Test)
How do we get the p-value?
(Hint: Z-statistics follows the Z-distribution)
To get the p-value, we calculate acquired z statistics as a quantile of the Z-distribution
In a population of third-year electrical engineering students, we know the average final score of a particular course is 70. In measuring students' comprehension, UKRIDA has established a standardized examination with a standard deviation of 10. We are interested to see whether students registered to this year course have different average, where 18 students averagely scored 75 on the final exam.
H0:¯x=μ0Ha:¯x≠μ0
In a population of third-year electrical engineering students, we know the average final score of a particular course is 70. In measuring students' comprehension, UKRIDA has established a standardized examination with a standard deviation of 10. We are interested to see whether students registered to this year course have different average, where 18 students averagely scored 75 on the final exam.
SE=10√18=2.36z=75−702.36=2.12


P(Z⩽2.12 | μ,σ):Z∼N(0,1)
P(Z⩽2.12 | μ,σ):Z∼N(0,1)
P(Z⩽2.12 | μ,σ):Z∼N(0,1)
2 * {1 - pnorm(2.12, 0, 1)}
## [1] 0.034What if we do not know σ?
What if we do not know σ?
We are unable to use the Z-distribution
What if we do not know σ?
We are unable to use the Z-distribution
Solution: use Student's T-distribution
Let X∼tνP(X=x)=Γ(ν+12)√νπ Γ(ν2)(1+x2ν)ν=n−1Let T∼tνT=Z√νV(Notation)(PDF)(Relationship)
Let X∼tνP(X=x)=Γ(ν+12)√νπ Γ(ν2)(1+x2ν)ν=n−1Let T∼tνT=Z√νV(Notation)(PDF)(Relationship)

Overview
t=¯x−μs/√n
set.seed(1)x <- rnorm(20, mean=120, sd=20)
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 75.7 112.3 127.2 123.8 135.2 151.9sd(x)
## [1] 18.3set.seed(1)x <- rnorm(20, mean=120, sd=20)
set.seed(1)x <- rnorm(20, mean=120, sd=20)
H0:¯x=120Ha:¯x≠120
set.seed(1)x <- rnorm(20, mean=120, sd=20)
t=¯x−μs/√n
t <- {{mean(x) - 120} / {sd(x) / sqrt(20)}} %T>% print()
## [1] 0.933

1 - pt(t, df=19)
## [1] 0.181
2 * {1 - pt(t, df=19)}
## [1] 0.363R?t.test(x, mu=120)
## ## One Sample t-test## ## data: x## t = 0.9, df = 19, p-value = 0.4## alternative hypothesis: true mean is not equal to 120## 95 percent confidence interval:## 115 132## sample estimates:## mean of x ## 124R?t.test(x, mu=120)
## ## One Sample t-test## ## data: x## t = 0.9, df = 19, p-value = 0.4## alternative hypothesis: true mean is not equal to 120## 95 percent confidence interval:## 115 132## sample estimates:## mean of x ## 124Overview
H0:¯x1−¯x2=dHa:¯x1−¯x2≠dd=μ1−μ2=0
In a unique case, we may find d≠0
t=¯x1−¯x2−dsp√1n1+1n2sp=√(n1−1)s21+(n2−1)s22νν=n1+n2−2(Statistics)(Pooled variance)(Degree of freedom)
t=¯x1−¯x2−dsp√1n1+1n2sp=√(n1−1)s21+(n2−1)s22νν=n1+n2−2(Statistics)(Pooled variance)(Degree of freedom)
Often, our data violate the equal variance assumption
Solution: Welch's T-Test
t=¯x1−¯x2−d√s21n1+s22n2ν=(n1−1)(n2−1)(n2−1)C2+(1−C2)(n1−1)C=s21n1s21n1s22n2(Statistics)(Degree of freedom)
t=¯x1−¯x2−d√s21n1+s22n2ν=(n1−1)(n2−1)(n2−1)C2+(1−C2)(n1−1)C=s21n1s21n1s22n2(Statistics)(Degree of freedom)
For the record:
Suppose we are collecting data on body height. Our population of interest will be students registered in UKRIDA, where we categorize sex as female and male. We acquire a normally distributed data from both sexes, where:
We have a sample of 25 females and 30 males, and would like conduct a hypothesis test on mean difference.
set.seed(5)tbl <- data.frame( "height" = c(rnorm(30, 170, 8), rnorm(25, 155, 16)), "sex" = c(rep("male", 30), rep("female", 25)))tapply(tbl$height, tbl$sex, summary)
## $female## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 123 146 154 158 173 190 ## ## $male## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 152 165 168 170 177 184tapply(tbl$height, tbl$sex, sd)
## female male ## 17.98 7.93
Does it follow the normal distribution?
tapply(tbl$height, tbl$sex, shapiro.test)
## $female## ## Shapiro-Wilk normality test## ## data: X[[i]]## W = 1, p-value = 0.7## ## ## $male## ## Shapiro-Wilk normality test## ## data: X[[i]]## W = 1, p-value = 0.2Yes, each group follows a normal distribution
car::leveneTest(tbl$height ~ tbl$sex)
## Levene's Test for Homogeneity of Variance (center = median)## Df F value Pr(>F) ## group 1 14.3 0.00039 ***## 53 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1car::leveneTest(tbl$height ~ tbl$sex)
## Levene's Test for Homogeneity of Variance (center = median)## Df F value Pr(>F) ## group 1 14.3 0.00039 ***## 53 ## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Levene's test suggests heterogenous variance (hint: significant p-value)
t.test(height ~ sex, data=tbl, var.equal=FALSE)
## ## Welch Two Sample t-test## ## data: height by sex## t = -3, df = 32, p-value = 0.004## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -19.87 -4.07## sample estimates:## mean in group female mean in group male ## 158 170Perform Welch's T-Test since sampled variances are not equal
t.test(height ~ sex, data=tbl, var.equal=TRUE)
## ## Two Sample t-test## ## data: height by sex## t = -3, df = 53, p-value = 0.002## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -19.27 -4.67## sample estimates:## mean in group female mean in group male ## 158 170Student's T-Test, to demonstrate type-I error inflation (hint: look at the p-value)

Overview
μd=μ1−μ2
H0:μd=0Ha:μd≠0
H0:μd=0Ha:μd≠0
In the current investigation, we are looking for the effect of a certain anti-hipertensive drug. First we measure the blood pressure baseline, then prescribe the drug to all subjects. Then, we re-measure the blood pressure after one month. Each subject has a unique identifier, so we can specify mean differences within paired samples. Suppose we have the following scenario in 30 sampled subjects:
Set our hypotheses:
H0:¯xd=0Ha:¯xd≠0
set.seed(1)tbl <- data.frame( "bp" = c(rnorm(30, 140, 12), rnorm(30, 133, 17)), "time" = c(rep("Before", 30), rep("After", 30)) %>% factor(levels=c("Before", "After")))# Measure the mean of mean differencemd <- with(tbl, bp[time=="Before"] - bp[time=="After"])# Calculate t-statisticst <- {mean(md)} / {sd(md) / sqrt(30)} %T>% print()
## [1] 3.12# Obtain p-value for a two-sided test2 * {1 - pt(t, df=29)}
## [1] 0.076# Comparison to built-in one-sample T-Testt.test(md, mu=0)
## ## One Sample t-test## ## data: md## t = 2, df = 29, p-value = 0.08## alternative hypothesis: true mean is not equal to 0## 95 percent confidence interval:## -0.64 12.10## sample estimates:## mean of x ## 5.73# Comparison to built-in paired T-Testt.test(bp ~ time, data=tbl, paired=TRUE)
## ## Paired t-test## ## data: bp by time## t = 2, df = 29, p-value = 0.08## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -0.64 12.10## sample estimates:## mean of the differences ## 5.73Overview
d=¯x1−¯x2spsp=√(s21+s22)2(Cohen's D)(Pooled SD)
set.seed(1)tbl <- data.frame( "bp" = c(rnorm(30, 140, 12), rnorm(30, 133, 17)), "time" = c(rep("Before", 30), rep("After", 30)) %>% factor(levels=c("Before", "After")))# Measure the mean of mean differencemd <- with(tbl, bp[time=="Before"] - bp[time=="After"])# Calculate t-statisticst <- {mean(md)} / {sd(md) / sqrt(30)} %T>% print()
## [1] 3.12# Obtain p-value for a two-sided test2 * {1 - pt(t, df=29)}
## [1] 0.076# Calculate pooled standard deviationsp <- sqrt({with(tbl, tapply(bp, time, var, simplify=FALSE)) %>% {do.call(add, .)}} / 2) %T>% print()
## [1] 12.4# Measure Cohen's distance{with(tbl, tapply(bp, time, mean, simplify=FALSE)) %>% {do.call(subtract, .)}} / sp
## [1] 0.464# Calculate power using the `psych` packaged <- psych::cohen.d(tbl ~ time) %T>% print()
## Call: psych::cohen.d(x = tbl ~ time)## Cohen d statistic of difference between two means## lower effect upper## bp -0.99 -0.47 0.05## ## Multivariate (Mahalanobis) distance between groups## [1] 0.47## r equivalent of difference between two means## bp ## -0.23# Power analysis using previous informationpwr::pwr.t.test(n=30, d=d$cohen.d[[2]], sig.level=0.05, type="paired")
## ## Paired t test power calculation ## ## n = 30## d = 0.472## sig.level = 0.05## power = 0.704## alternative = two.sided## ## NOTE: n is number of *pairs*Query?
Overview
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |