When we obtain a normally-distributed data, parameters $\mu$ and $\sigma$ from our PDF can completely explain the behaviour seen in our sample. With $\mu$ represents the central tendency and $\sigma$ the spread, we can directly compare similarly distributed samples. Often, we need to confirm how much our average value differs from other observations. In doing so, we are facing a mean difference problem in our venture of statistics. This lecture will help us proving mean differences in one-sample and two-sample problems.

Mean Difference

We may have a vivid recollection on previous lectures of data distribution and the Central Limit Theorem (CLT). For any data following a normal distribution, centering and scaling according to its $\mu$ and $\sigma$ results in a standardized normal distribution, i.e. a Z-distribution.

$$\frac{x - \bar{x}}{s} \sim N(0, 1) \tag{1}$$

For any given distribution, the mean of such a sample will undergo a convergence of random variable into a normal distribution with parameters of $\mu$ and $\frac{\sigma}{\sqrt{n}}$.

$$\bar{X} \xrightarrow{d} N(\mu, \frac{\sigma}{\sqrt{n}}) \tag{2}$$

With known $\mu$ and $\sigma$, we can make a direct comparison between our data and the population. However, how if we do not know the parameter $\mu$? We can make a close estimate using its statistics, $\bar{x}$. A mean difference is simply a result of subtracting sampled mean from the hypothesized parameter, which corresponds to $\bar{x} - \mu_0$. However, our sample bounds to have error, either a systematic or unsystematic (random) ones. An adjustment to such problems requires us to divide our measures into the standard error. Obtained quotient is our statistics of interest, which will follow a Z-distribution.

\begin{align} SE &= \frac{\sigma}{\sqrt{n}} \tag{Standard Error} \\ \\ z &= \frac{\bar{x} - \mu_0}{SE} \\ &= \frac{\bar{x} - \mu_0}{^{\sigma}/\tiny{\sqrt{n}}} \tag{One-sample Test} \end{align}

Having our statistics as an element of Z-distribution, we can compute the p-value by looking at the probability of our statistical value. As always, we first need to initiate the significant level $\alpha$ so we can measure where our statistics is located in relation to the significant value $x: P(X \leqslant x\ |\ 0, 1)=\alpha$. We regard this approach as a Z-Test, where the p-value represents probabilities of $P(Z = z\ |\ 0, 1)$.

We shall consider the following scenario as an example:

In a population of third-year electrical engineering students, we know the average final score of a particular course is 70. In measuring students’ comprehension, UKRIDA has established a standardized examination with a standard deviation of 10. We are interested to see whether students registered to this year course have different average, where 18 students averagely scored 75 on the final exam.

Then, we need to formulate our hypotheses:

\begin{align} H_0 &: \bar{x} = \mu_0 \\ H_a &: \bar{x} \neq \mu_0 \end{align}

Followed by computing the $z$ statistics:

\begin{align} SE &= \frac{10}{\sqrt{18}} &= 2.36 \\ z &= \frac{75 - 70}{2.36} &= 2.12 \end{align}

Where does $z$ located in Z-distribution?

Assigning our significance value on both tails results in:

Since our $H_a$ assumes non-equality, we can compute the p-value according to two-tailed test procedures. First we need to find the cumulative probability of $z$ which satisfies:

$$P(Z \leqslant 2.12\ |\ \mu,\sigma): Z \sim N(0, 1)$$

Then we have to subtract $P(Z=z)$ from 1 and multiply the difference by 2 to obtain the two-tailed p-value.

2 * {1 - pnorm(2.12, 0, 1)}

| [1] 0.034

So far, we understand that Z-test requires the sample to follow a normal distribution. Before conducting any formal test, it is imperative to ascertain the sampled distribution, i.e. using a goodness of fit or normality test. We do not need to know the parameter $\mu$ because we can hypothesize the value. However, we need the parameter $\sigma$ to correctly compute $z$. It is becoming quite problematic when we do not know the value of $\sigma$, of which we often don’t! In such a case, we need to consider using a T-distribution instead.

Student’s T-Distribution

Student’s T-distribution only depends on 1 parameter, degree of freedom $\nu$. Mathematically, T-distribution has the following notation: $X \sim t_{\nu}$. The T-distribution is pivotal to compute statistics in mean difference of a normally-distributed data. Degree of freedom $\nu$ in T-distribution is simply $n - 1$, where $n$ represents the total number of sample.

\begin{align} Let\ & X \sim t_\nu \tag{Notation} \\ P(X=x) &= \frac{\Gamma \big( \frac{\nu+1}{2} \big)}{\sqrt{\nu \pi}\ \Gamma \big( \frac{\nu}{2} \big)} \bigg( 1 + \frac{x^2}{\nu} \bigg) \tag{PDF} \\ \nu &= n - 1 \\ \\ Let\ & T \sim t_\nu \\ T &= Z \sqrt{\frac{\nu}{V}} \tag{Relationship} \end{align}

Aside from its relationship with Z-distribution, T-distribution also independently relates to the $chi^2$ distribution, where they share the same $\nu$ degree of freedom.

One Sample T-Test

One sample T-test is analogous to the Z-test, where we use it when we could not ascertain the $\sigma$. In place of $\sigma$, T-test use $s$ as an estimate to the population parameter. By adapting the $z$ statistics equation, we can compute $t$ statistics as follow:

$$t = \frac{\bar{x}-\mu}{^s \big/ \tiny{\sqrt{n}}}$$

As an example, we shall generate an array of random numbers following a normal distribution where $X \sim N(120, 20)$.

set.seed(1)
x <- rnorm(20, mean=120, sd=20)

First, we do some basic exploratory analysis by finding the central tendency and spread.

summary(x)

|    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
|    75.7   112.3   127.2   123.8   135.2   151.9

sd(x)

| [1] 18.3

We let $x \sim N(120, 20)$, yet our $\bar{x}$ is 123.81 with an $s$ of r sd(s) and a $\nu$ of 19. Does our statistics differ from the parameter $\mu=120$? We can further formulate our question into hypotheses:

\begin{align} H_0 &: \bar{x} = 120 \\ H_a &: \bar{x} \neq 120 \end{align}

Then we can determine the $t$ statistics:

t <- {{mean(x) - 120} / {sd(x) / sqrt(20)}} %T>% print()

| [1] 0.933

Then we shall locate $t$ statistics into its distribution of $t_{19}$:

Then we can compute the p-value for a one-tailed test:

1 - pt(t, df=19)

| [1] 0.181

Also the p-value for a two-tailed test:

2 * {1 - pt(t, df=19)}

| [1] 0.363

How does our calculation compare to the built-in function in R?

t.test(x, mu=120)

| 
|   One Sample t-test
| 
| data:  x
| t = 0.9, df = 19, p-value = 0.4
| alternative hypothesis: true mean is not equal to 120
| 95 percent confidence interval:
|  115 132
| sample estimates:
| mean of x 
|       124

At this point, we may have realised R only prints the rounded value for acquired computation. If we are interested to see the actual value, we may save our test result as an object, then directly call the specific result. In following demonstration, we shall save the T-test and obtain its p-value.

t.result <- t.test(x, mu=120)
t.result$p.value

| [1] 0.363

Comparing our computation and the result acquired from R built-in function, we failed to reject our $H_0$, so we conclude $\bar{x}=\mu_0=120$.

Unpaired T-Test

When we two samples, we can compare the central tendency of both samples by computing the mean difference. If both data follow a normal distribution, we are conducting a two-sample T-test. As in previous examples, unpaired T-Test (or rather, T-Test in general) assumes normality. Moreover, there are further assumptions when conducting T-test, resulting in two distinctive types of said test, i.e. a Student’s and Welch’s approach. T-test is arguably robust in non-normally distributed data to a certain degree, where skewedness and outliers highly influence its robustness. The problem with robustness is the way we properly evaluate how T-test may provide a correct inference in the event of having non-normal data. A few simulations have demonstrated how T-test robust against $\chi^2$ distribution with a specific range of $\nu$. However, when we have a real-world data, we often do not know their underlying parameters. In such cases, it is safe to follow normality assumption to avoid type-I error inflations. To test mean difference between two samples, we formulate following hypotheses:

\begin{align} H_0 &: \bar{x}_1 - \bar{x}_2 = d \\ H_a &: \bar{x}_1 - \bar{x}_2 \neq d \\ d &= \mu_1 - \mu_2 = 0 \end{align}

As outlined above, our hypotheses depends on the value of $d$, which defaulted to $0$. In some rare occasion, we may apply a different value of $d$, but for now we will stick with $d=0$.

Student’s T-Test

Student’s approach in T-test is a test with pooled variance. We may conduct this test when we know that variances in both samples are comparably similar. We denote similarity as a homogeneity of variance, which we can prove using the Levene’s test.

\begin{align} t &= \frac{\bar{x}_1 - \bar{x}_2 - d}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \tag{Statistics} \\ s_p &= \sqrt{\frac{(n_1 - 1) s_1^2 + (n_2 - 1) s_2^2}{\nu}} \tag{Pooled variance} \\ \nu &= n_1 + n_2 - 2 \tag{Degree of freedom} \end{align}

However, if our data fail to fulfill homogeneity of variance assumption, we shall appropriately use a Welch’s T-test.

Welch’s T-Test

Welch’s approach still assume normality, but give more leniency to equality of variance. As a result, Welch modified the equation to compute the $t$ statistics, where we may find:

\begin{align} t &= \frac{\bar{x}_1 - \bar{x}_2 - d} {\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \tag{Statistics} \\ \nu &= \frac{(n_1-1)(n_2-1)}{(n_2-1)C^2 + (1-C^2)(n_1-1)} \tag{Degree of freedom} \\ \\ C &= \frac{\frac{s_1^2}{n_1}}{\frac{s_1^2}{n_1} \frac{s_2^2}{n_2}} \end{align}

Compared to Student’s T-test, Welch’s provide a different measure of $\nu$ to adjust for differences in sample variances. As an example, we may consider following situation:

Suppose we are collecting data on body height. Our population of interest will be students registered in UKRIDA, where we categorize sex as female and male. We acquire a normally distributed data from both sexes, where:

$female \sim N(155, 15)$

$male \sim N(170, 12)$

We have a sample of 25 females and 30 males, and would like conduct a hypothesis test on mean difference.

set.seed(5)
tbl <- data.frame(
    "height" = c(rnorm(30, 170, 8), rnorm(25, 155, 16)),
    "sex" = c(rep("male", 30), rep("female", 25))
)

tapply(tbl$height, tbl$sex, summary)

| $female
|    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
|     123     146     154     158     173     190 
| 
| $male
|    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
|     152     165     168     170     177     184

tapply(tbl$height, tbl$sex, sd)

| female   male 
|  17.98   7.93

Do both groups in our sample follow a normal distribution?

tapply(tbl$height, tbl$sex, shapiro.test)

| $female
| 
|   Shapiro-Wilk normality test
| 
| data:  X[[i]]
| W = 1, p-value = 0.7
| 
| 
| $male
| 
|   Shapiro-Wilk normality test
| 
| data:  X[[i]]
| W = 1, p-value = 0.2

Considering both p-values being $\geqslant 0.05$ in Shapiro-Wilk test, we can ascertain their normality. We then tested for homogeneity of variance using Lavene’s test:

car::leveneTest(tbl$height ~ tbl$sex)

| Levene's Test for Homogeneity of Variance (center = median)
|       Df F value  Pr(>F)    
| group  1    14.3 0.00039 ***
|       53                    
| ---
| Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpreting a low p-value, we can conclude variances in both groups are not equal to one another. In this case, we will follow Welch’s method.

t.test(height ~ sex, data=tbl, var.equal=FALSE)

| 
|   Welch Two Sample t-test
| 
| data:  height by sex
| t = -3, df = 32, p-value = 0.004
| alternative hypothesis: true difference in means between group female and group male is not equal to 0
| 95 percent confidence interval:
|  -19.87  -4.07
| sample estimates:
| mean in group female   mean in group male 
|                  158                  170

Just for curiosity sake, we may want to try Student’s method as well and see how it is different from Welch’s:

t.test(height ~ sex, data=tbl, var.equal=TRUE)

| 
|   Two Sample t-test
| 
| data:  height by sex
| t = -3, df = 53, p-value = 0.002
| alternative hypothesis: true difference in means between group female and group male is not equal to 0
| 95 percent confidence interval:
|  -19.27  -4.67
| sample estimates:
| mean in group female   mean in group male 
|                  158                  170

The Student’s T-test reported a lower p-value compared to Welch’s T-test when we have unequal variance. A low p-value is not a bad sign per se, but we need to be wary when we violated required assumptions. A low p-value may indicate an inflation in statistical error. After conducting our test, we can summarize our findings by visualizing them.

Visualizing our results is important when conducting a statistical inference, since it gives the reader a clearer representation on what we observed in our data. Both figures give similar information, yet conveyed in different fashions.

Paired T-Test

The equation in unpaired T-test implicitly imply independence of each data point, where we could not correctly infer a paired data. We may consider using a paired T-test when we have following situations:

Difference between multiple measurements
Probability events where each instance influence another

In measuring mean differences between paired data, first we need to reduce its complexity. Suppose $\mu_1$ and $\mu_2$ represent measurement in $t_1$ and $t_2$. Both measures represent same subject (within comparison). We can calculate the difference between both samples:

$$\mu_d = \mu_1 - \mu_2$$

Then, we only needed to take into account one-sample difference, where we hypothesize:

\begin{align} H_0 &: \mu_d = 0 \\ H_a &: \mu_d \neq 0 \end{align}

Does it seem familiar? Because it is! By reducing the complexity, we can infer differences in paired data using a one-sample T-test to $mu_d$. Following example will help us visualizing the concept:

In the current investigation, we are looking for the effect of a certain anti-hipertensive drug. First we measure the blood pressure baseline, then prescribe the drug to all subjects. Then, we re-measure the blood pressure after one month. Each subject has a unique identifier, so we can specify mean differences within paired samples. Suppose we have the following scenario in 30 sampled subjects:

$X_1 \sim N(140, 12)$

$X_2 \sim N(130, 17)$

We then set our hypotheses:

\begin{align} H_0 &: \bar{x}_d = 0 \\ H_a &: \bar{x}_d \neq 0 \end{align}

set.seed(1)
tbl <- data.frame(
    "bp" = c(rnorm(30, 140, 12), rnorm(30, 133, 17)),
    "time" = c(rep("Before", 30), rep("After", 30)) %>%
        factor(levels=c("Before", "After"))
)

# Measure the mean of mean difference
md <- with(tbl, bp[time=="Before"] - bp[time=="After"])

# Calculate t-statistics
t <- {mean(md)} / {sd(md) / sqrt(30)} %T>% print()

| [1] 3.12

# Obtain p-value for a two-sided test
2 * {1 - pt(t, df=29)}

| [1] 0.076

# Comparison to built-in one-sample T-Test
t.test(md, mu=0)

| 
|   One Sample t-test
| 
| data:  md
| t = 2, df = 29, p-value = 0.08
| alternative hypothesis: true mean is not equal to 0
| 95 percent confidence interval:
|  -0.64 12.10
| sample estimates:
| mean of x 
|      5.73

# Comparison to built-in paired T-Test
t.test(bp ~ time, data=tbl, paired=TRUE)

| 
|   Paired t-test
| 
| data:  bp by time
| t = 2, df = 29, p-value = 0.08
| alternative hypothesis: true mean difference is not equal to 0
| 95 percent confidence interval:
|  -0.64 12.10
| sample estimates:
| mean difference 
|            5.73

Choosing an Appropriate Test

All tests explained in this post assume data normality, both for T-test and Z-test. As a general rule of thumb, we may use following conventions:

One-sample test:
- Known $\sigma \to$ use Z-Test
- Unknown $\sigma \to$ use one-sample T-Test
Two-sample test $\to$ do Levene’s test
- Equal variance: Student’s method (pooled variance)
- Unequal variance: Welch’s method
Paired T-Test: Basically one-sample T-Test on sampled differences

Effect Size

The previous lecture on sample size equation served as a brief introduction on statistical power, which presented us a new concept of effect size. There are two related concepts to effect size and statistical power, viz. sample size $n$ and significance level $\alpha$. Effect size calculation varies on the type of data we consider and how we conduct our formal test. In this section, we will focus on measuring effect size in a mean difference between two groups using Cohen’s distance $d$.

\begin{align} d &= \frac{\bar{x}_1 - \bar{x}_2}{s_p} \tag{Cohen's D} \\ s_p &= \sqrt{\frac{(s_1^2 + s_2^2)}{2}} \tag{Pooled SD} \end{align}

As an example, we will re-use previous scenario:

set.seed(1)
tbl <- data.frame(
    "bp" = c(rnorm(30, 140, 12), rnorm(30, 133, 17)),
    "time" = c(rep("Before", 30), rep("After", 30)) %>%
        factor(levels=c("Before", "After"))
)

# Measure the mean of mean difference
md <- with(tbl, bp[time=="Before"] - bp[time=="After"])

# Calculate t-statistics
t <- {mean(md)} / {sd(md) / sqrt(30)} %T>% print()

| [1] 3.12

# Obtain p-value for a two-sided test
2 * {1 - pt(t, df=29)}

| [1] 0.076

We will calculate $d$ by issuing following command in R. Beware though, different method of computing also exists as we can be more flexible in expressing our code. I prefer this method because it clearly explains what we do at the expense of the need to understand apply family of functions in R, which is a good thing to know if you were to learn R (and I hope you will!)

# Calculate pooled standard deviation
sp <- sqrt({with(tbl,
    tapply(bp, time, var, simplify=FALSE)) %>% {do.call(add, .)}
} / 2) %T>% print()

| [1] 12.4

# Measure Cohen's distance
{with(tbl,
    tapply(bp, time, mean, simplify=FALSE)) %>% {do.call(subtract, .)}
} / sp

| [1] 0.464

As a comparison, we can also compute $d$ using the psych package.

# Calculate power using the `psych` package
d <- psych::cohen.d(tbl ~ time) %T>% print()

| Call: psych::cohen.d(x = tbl ~ time)
| Cohen d statistic of difference between two means
|    lower effect upper
| bp -0.98  -0.47  0.04
| 
| Multivariate (Mahalanobis) distance between groups
| [1] 0.47
| r equivalent of difference between two means
|    bp 
| -0.23

Finally, we can apply $d$ to computing the statistical power:

# Power analysis using previous information
pwr::pwr.t.test(n=30, d=d$cohen.d[[2]], sig.level=0.05, type="paired")

| 
|      Paired t test power calculation 
| 
|               n = 30
|               d = 0.472
|       sig.level = 0.05
|           power = 0.704
|     alternative = two.sided
| 
| NOTE: n is number of *pairs*

Parametric: Mean in Two Groups