.column.bg-main4[.vmiddle.content[
.amber[Aly Lamuri]  
Indonesia Medical Education and Research Institute
]]

---

name: overview
layout: true
class: split-30 hide-slide-number bg-main2
count: false

---

.column.bg-main4[.vmiddle.content[
- .amber[Data type]
- Probability Density Function
- Goodness of fit test
- Test of normality
- Central Limit Theorem
]]

---

.column.font2[.vmiddle.center.content[
{{content}}
]]

---

???
- Numerous conventions in describing data
- Understanding the nature behind categorical and numeric is more important
- Examples on established convention:
  - Nominal, ordinal, interval, ratio
  - Categorical, discrete, continuous

---

<img src="https://www.incimages.com/uploaded_files/image/970x450/male-female-sign-1940x900_35330.jpg" width="100%">
Nominal

???
Other examples:
- Types of car
- Brands
- Netflix shows

---

<img src="https://static.vecteezy.com/system/resources/previews/000/680/216/original/spicy-level-of-red-hot-pepper.jpg" width="100%">
Ordinal

???
Other examples:
- Disease severity
- Qualitative measure: bad `$\to$` good

---

<img src="https://nypost.com/wp-content/uploads/sites/2/2018/03/180315-water-bottles-feature-image.jpg?quality=90&strip=all&w=1200" width="100%">
Discrete (clue: countable)

---

<img src="https://livelaughloveandlose.files.wordpress.com/2016/05/weighing-scales.jpg" width="90%">
Continuous (clue: measurable)

???
Continuous:
- Interval
- Ratio

---

.row.bg-main2[.vmiddle.content[
# Continuous Data
]]

- Has a fixed distance
- Arithmetic: addition and subtraction
- Examples:
  - Likert scale
  - Temperature in other scales
  ]]

- Has an absolute zero
- Infinitesimal measure
- All arithmetic rules are applicable
- Examples:
  - Temperature in Kelvin
  - Weight
  ]]

]]

---

# How about Likert-type item?

.font2[
- Usually uses a distinctive scale out of 4, 5, 7 and 10 units
- Some regards Likert-type question as discrete counts
- While for others, a continuous interval
- Context-dependant
]

---

# Checkpoint! What type of .amber[data] do we have?

--
1. We were conducting a survey in .amber[three universities].

--
1. From each university, we sampled the .amber[first, second, penultimate and final] year students in a four-year programme.

--
1. We nicely asked them to indicate their .amber[level of burnout] using a Likert-type self-report inventory.

--
1. We also kindly measured their .amber[blood cortisol] level.

---

.column.bg-main4[.vmiddle.content[
- Data type
- .amber[Probability Density Function]
- Goodness of fit test
- Test of normality
- Central Limit Theorem
]]

---

# Probability

- An .amber[event] `$E$` occurring within a particular .red[sample space] `$S$`
- .amber[Event]: Expected results
- .red[Sample space]: All possible outcomes
- Probability `$P$` is a proportion of event divided by its sample space
- Or mathematically:

`$$P(E=e) = \frac{E}{S}$$`

---

- Suppose we have a fair coin and doing a flip 10 times, where `H` indicates the
  head and `T` indicates the tail

--
- Then, our sample space:

```r
set.seed(1)
*S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
```

```
##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
```

---

--
- Then, our event:

```r
*E <- S[which(S == "H")] %T>% print()
```

```
## [1] "H" "H" "H" "H" "H" "H"
```

---

- Thus, we can regard the probability of having a desired outcome as a
  .amber[relative frequency] of events in a given sample space

--
- As such:

```r
length(E) / length(S)
```

```
## [1] 0.6
```

--
- Ten flips using a fair coin resulted in 60% chance of having heads

```
##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
```

---

# Determine the Probability

---

???

- So far, we have learnt about enumeration
- In such a method, we determine a probability as a relative frequency measure

--
## Caveats in enumeration

- Higher sample space `$\to$` harder to solve
- It is more apparent with sequential problem
- Sequential problem: when you need to calculate probability from two different instances
- Example: the probability of having three `4` while rolling a dice three times

---

## Sample case `$\to$` .amber[the urn problem]

- We have an urn filled with .cyan[30 blue] and .red[50 red] balls
- All balls are identical except for color
- In the urn, all balls have an equal distribution
- .amber[**Task:**] Take three balls .amber[without] replacement
- .amber[Question:] How high is the chance of getting three blue balls?

---

```r
    B (30/80)
   /
  /
80
  \
   \          
    R (50/80)
```

---

```r
               B (29/79)
             /
    B (30/80)
   /	     \
  /            R (50/79)
80
  \
   \          
    R (50/80)
```

---

```r
                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /	     \
  /            R (50/79)
80
  \
   \          
    R (50/80)
```

The chance for having .cyan[three blue balls] is 0.0494

---

```r
                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /	     \          / B (?)
  /            R (50/79)
80                      \ R (?)
  \
   \          
    R (50/80)
```

We have learnt how to draw a tree diagram. Now, what should we fill the question mark with?

---

```r
                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /	     \          / B (29/78)
  /            R (50/79)
80                      \ R (49/78)
  \
   \          
    R (50/80)
```

---

# Let's roll the dice :)

---

.font2[
- To learn resampling method, we will conduct a short experiment
- This experiment relies on a simple function
- Said function will simulate an independent dice-roll
- The only parameter is `n`, indicating the number of roll 
]

```r
dice <- function(n) {
	sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))
}
```

--
.font2[Let's see whether our function work...]

```r
dice(1)
```

```
## [1] 3
```

---

- So we shall roll the dice 10 times
- Let .amber[4] be our outcome of interest
- How high is the probability of having the event within 10 trials?

```r
set.seed(1)
*roll <- dice(10) %T>% print()
```

```
##  [1] 3 4 5 1 3 1 1 5 5 2
```

- How high is the probability of getting 4?

--
- Turns out, it is .amber[1/10]

--
- We have a fair dice, why is the probability not .amber[1/6]?

--
- .lime[**Hint:**] sample and population

--
- The .amber[more sample] we got, the closer it is to .amber[represent the population]

---

- What will we get with different number of rolls?

--
- 100 rolls:

```r
set.seed(1); roll <- dice(100)
sum(roll==4) / length(roll)
```

```
## [1] 0.25
```

--
- 1,000 rolls:

```r
set.seed(1); roll <- dice(1000)
sum(roll==4) / length(roll)
```

```
## [1] 0.2
```

--
- 10,000 rolls:

```r
set.seed(1); roll <- dice(10000)
sum(roll==4) / length(roll)
```

```
## [1] 0.1724
```

---

- 100,000 rolls:

```r
set.seed(1); roll <- dice(100000)
sum(roll==4) / length(roll)
```

```
## [1] 0.1661
```

--
- 1,000,000 rolls:

```r
set.seed(1); roll <- dice(1000000)
sum(roll==4) / length(roll)
```

```
## [1] 0.1664
```

--
- 10,000,000 rolls:

```r
set.seed(1); roll <- dice(10000000)
sum(roll==4) / length(roll)
```

```
## [1] 0.1666
```

---

.font2[
- With more trials, we get closer to the expected probability in a fair dice
- Which is .amber[1/6], or equivalently .amber[0.1667]
- The .red[error] of estimated probability is .red[inversely proportional] to the number of trial
]

--
.font2[
Or mathematically:

`$$\epsilon = \sqrt{\frac{\hat{p} (1-\hat{p})}{N}},\ where:$$`

`$\epsilon$`: Error  
`$\hat{p}$`: Estimated probability (current trial)  
`$N$`: Number of resampling
]

---

How high is the error in our trials?

- First we need to set the function to calculate error

```r
epsilon <- function(p.hat, n) {
	sqrt({p.hat * (1-p.hat)}/n)
}
```

- Get the roll and probability

```r
roll <- c(10, 100, 1000, 10000, 100000, 1000000, 10000000)
prob <- sapply(roll, function(n) {
	set.seed(1); roll <- dice(n)
	sum(roll==4) / length(roll)
})
```

---

```r
df <- data.frame(list("roll"=roll, "prob"=prob))
df %>% knitr::kable() %>% kable_styling()
```

```r
df$error <- mapply(function(p.hat, n) {
	epsilon(p.hat, n)
}, p.hat=df$prob, n=df$roll) %T>% print()
```

```
## [1] 0.0948683 0.0433013 0.0126491 0.0037773 0.0011769 0.0003724 0.0001178
```

---

Here is a nice figure to summarize the concept:

---

And another figure to see the error:

---

# Homework

.font2[
- Previously, we used tree diagram to determine the probability in the urn problem
- Solve .amber[the urn problem] using resampling method
- .amber[**Question:**] What is the probability of getting three .red[red balls]?
]

.font2[
Task description:
- Do a trial of `$\{100, 200, 500, 1000, 2000, 5000\}$`
- Set `1` as the seed for each resampling
- Plot the probability and error
- Briefly explain your results
- You may use any programming language you are familiar with
- You just need to present the plot and explanation
]

---

# Random Variables

- All sampled random variables should be .amber[independent] from one another
- Each sampling procedure have to be .amber[identical], as to produce similar probability

- If they are I.I.D, we can approximate the probability using:
  - Probability .amber[Mass] Function (.amber[discrete] variable)
  - Probability .amber[Density] Function (.amber[continuous] variable)

`\begin{align}
P(E=e) &= f(e) > 0: E \in S \tag{1} \\
\displaystyle \sum_{e \in S} f(e) &= 1 \tag{2} \\
P(E \in A) &= \displaystyle \sum_{e \in A} f(e) \tag{3}: A \subset S
\end{align}`

???

- The function is arbitrary, it can take on any form
- There are myriad distributions
- We will look at specific examples

---

# Binomial Distribution

.font2[
- Have an identical iteration over `$n$` times of trial
- Each iteration corresponds to a .amber[Bernoulli trial]
- All instances are independent
]

---

`\begin{align}
f(x) &= \binom{n}{x} p^x (1-p)^{n-x} \tag{1} \\
\binom{n}{x} &= \frac{n!}{x! (n-x)!}
\end{align}`

Or simply denoted as: `$X \sim B(n, p)$`

`\begin{align}
\mu &= n \cdot p \\
\sigma &= \sqrt{\mu \cdot (1-p)}
\end{align}`

---

---

---

# Geometric Distribution

.font2[
- Describes .amber[number of failures] before getting an event
- Follows .amber[Bernoulli trial]
- A derivation of binomial distribution, with `$x=1$`
]

---

`$$f(n) = P(X=n) = p (1-p)^{n-1}, with:$$`

`$n$`: Number of trials to get an event  
`$p$`: The probability of getting an event  
Or simply denoted as `$X \sim G(p)$`

`\begin{align}
\mu &= \frac{1}{p} \\
\sigma &= \sqrt{\frac{1-p}{p^2}}
\end{align}`

---

---

# Poisson Distribution

.font2[
- Suppose we know the rate of certain outcomes
- Poisson distribution defines the probability of an outcome happening `$x$` times
- Limited to a particular time frame (often described as observation period)
]

---

`$$f(x) = \frac{e^{-\lambda}\lambda^x}{x!},\ with:$$`

`$x$`: The number of expected events  
`$e$`: Euler's number  
`$\lambda$`: Average number of events in one time frame

Or simply denoted as `$X \sim P(\lambda)$`

`\begin{align}
\mu &= \lambda \\
\sigma &= \sqrt{\lambda}
\end{align}`

---

---

# Uniform Distribution

.font2[
- A continuous function describing .amber[uniform] probabilities
- Hence the name: uniform distribution
- Useful in random number generator `$\to$` for randomization in clinical trials
]

???

- We finished the first part of distribution: discrete
- Now, we shall see continuous distributions and their properties

---

`$$f(x) = \frac{1}{b-a}$$`

Or simply denoted as `$X \sim U(a,b)$`

`\begin{align}
\mu &= \frac{b+a}{2} \\
\sigma &= \frac{(b-a)^2}{12}
\end{align}`

---

---

# Exponential Distribution

.font2[
- A reparameterization of Poisson distribution
- We are interested to see how long of a .amber[time frame needed] to observe an event
]

---

???

- Time frame is intangible
- It is not always **time**, it could be other continuous measures
- Examples: Mileage, weight, volume, etc.

`$$f(x) = \lambda e^{-x \lambda}, with:$$`

`$x$`: Time needed to observe an event  
`$\lambda$`: The rate for a certain event

Or simply denoted as `$X \sim Exponential(\lambda)$`

`$$\mu = \sigma = \frac{1}{\lambda}$$`

---

---

# Gamma Distribution

- Exponential distribution is a gamma distribution without a .amber[shape] parameter
- Essentially, gamma distribution finds its uses in similar cases as exponential distribution
- Relies on the gamma function `$\Gamma(\alpha)$`

---

`\begin{align}
f(x) &= \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-x \beta} \\
\Gamma(\alpha) &= \displaystyle \int_0^\infty y^{\alpha -1} e^{-y}\ dy,\ with:
\end{align}`

`$\beta$`: Rate ( `$\lambda$` in exponential PDF)  
`$\alpha$`: Shape  
`$\Gamma$`: Gamma function  
`$e$`: Euler number

Or simply denoted as `$X \sim \Gamma(\alpha, \beta)$`

If we were to assign the shape parameter `$\alpha=1$`, we get an exponential PDF.

--
.amber[Therefore,] `$Exponential(\lambda) \sim \Gamma(1, \lambda)$`.

`\begin{align}
\mu &= \frac{\alpha}{\beta} \\
\sigma &= \frac{\sqrt{\alpha}}{\beta}
\end{align}`

---

---

# `$\chi^2$` Distribution.amber[s]

.font2[
- .amber[Special cases] of a Gamma distribution
- Widely used in statistical .amber[inferences]
]

---

`$$f(x) = \frac{1}{\Gamma (k/2) 2^{k/2}} x^{k/2 - 1} e^{-x/2},\ with:$$`

`$k$`: Degree of freedom  
The rest are Gamma PDF derivations

Or simply denoted as `$X \sim \chi^2(k)$`

`\begin{align}
\mu &= k \\
\sigma &= \sqrt{2k}
\end{align}`

---

---

# Normal Distribution

.font2[
- Ubiquitous in real-world data
- Symmetric with `$\mu$` and `$\sigma$` completely describes the distribution
]

---

`$$f(x) = \frac{1}{\sigma \sqrt{2\pi}}exp \bigg\{ -\frac12 \bigg( \frac{x-\mu}{\sigma} \bigg)^2 \bigg\},\ with:$$`

`$x \in \mathbb{R}: -\infty < x < \infty$`  
`$\mu \in \mathbb{R}: -\infty < \mu < \infty$`  
`$\sigma \in \mathbb{R}: 0 < \sigma < \infty$`

Or simply denoted as `$X \sim N(\mu, \sigma)$`

---

---

---

---

.column.bg-main4[.vmiddle.content[
- Data type
- Probability Density Function
- .amber[Goodness of fit test]
- Test of normality
- Central Limit Theorem
]]

---

# Goodness of Fit Test

.font2[
- To determine whether your data follow a certain distribution
- Numerous methods exist, we will dig into more popular ones
- Given correct parameters, some methods can fully describe your data
- `$H_0$`: Given data follow a certain distribution
- `$H_1$`: Given data does not follow a certain distribution
]

---

# Binomial Test

.font2[
- An adaptation from binomial PMF
- To determine whether acquired probability followed .amber[Bernoulli] trial's
]

---

---

Remember we previously tossed a coin 10 times?

```r
set.seed(1)
*S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
```

```
##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
```

```r
length(E) / length(S)
```

```
## [1] 0.6
```

If it represents a Bernoulli trial, it should satisfy `$P(X=6)$` in such a way
that we cannot reject the `$H_0$` when calculating its probability:

`$$P(X=6) = \binom{10}{6}0.5^6(1-0.5)^4$$`

---

Luckily, we .amber[do not] need to compute it by hand 
--
(yet)

```r
binom.test(x=6, n=10, p=0.5)
```

```
## 
## 	Exact binomial test
## 
## data:  6 and 10
## number of successes = 6, number of trials = 10, p-value = 0.8
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2624 0.8784
## sample estimates:
## probability of success 
##                    0.6
```

Interpreting the p-value, we cannot reject the `$H_0$`, so our coin toss followed
the Bernoulli trial after all.

---

# Kolmogorov-Smirnov Test

.font2[
- This test is available to determine various distribution
- Works as a non-parametric test
- Pretty much robust, only second to .amber[Anderson-Darling] test on normal distribution
]

???

Robustness based on yielded power

---

Let `$X \sim Exponential(2): n = 100$`

```r
set.seed(1); X <- rexp(n=100, rate=2)
```

By imputing `$\lambda$` variable, Kolmogorov-Smirnov can compute its goodness of fit

```r
ks.result <- ks.test(X, pexp, rate=2)
```

---

```r
print(ks.result)
```

```
## 
## 	One-sample Kolmogorov-Smirnov test
## 
## data:  X
## D = 0.084, p-value = 0.5
## alternative hypothesis: two-sided
```

---

# Visual Examination

---

.font2[
- Okay, doing math is cool and all
- But in a large sample, even a small deviation will result in `$H_0$` rejection
- Which mean, previously mentioned tests are of no use!
- We can rely on some visual cues to determine the distribution though
]

---

Hey, that's a good start! 
--
This does not clearly suggest a specific distribution though :(

---

Quantile-Quantile Plot (QQ Plot) can give a better visual cue :)

---

.column.bg-main4[.vmiddle.content[
- Data type
- Probability Density Function
- Goodness of fit test
- .amber[Test of normality]
- Central Limit Theorem
]]

---

# Test of Normality

.font2[
- Practically a subset of goodness of fit test
- Some are more appropriate under certain circumstances
- We shall see through widely used ones
- `$H_0$`: Sample follows the normal distribution
- `$H_0$`: Sample does not follow the normal distribution
]

---

# Shapiro-Wilk Test

.font2[
- A well-established test to assess normality
- Can tolerate skewness to a certain degree
- Implementation in `R`: sample size between 3 and 5000
]

---

# Anderson-Darling Test

.font2[
- Less well-known compared to Shapiro-Wilk
- Gives more weight to the tails
- Implementation in `R`: minimum sample size is 7
]

---

# Demonstration

Let `$X \sim N(0, 1): n=100$`

```r
set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)
```

---

## Shapiro-Wilk

```r
shapiro.test(X)
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  X
## W = 1, p-value = 1
```

---

## Anderson-Darling

```r
nortest::ad.test(X)
```

```
## 
## 	Anderson-Darling normality test
## 
## data:  X
## A = 0.16, p-value = 0.9
```

---

# Visual Examination

---

# `$\chi^2$` and Normal Distribution

.font2[
- Raise a normally distributed data to the power of two
- It shall follow a `$\chi^2$` distribution with 1 degree of freedom
]

---

For demonstration purposes, we will re-use `$X \sim N(0, 1): n=100$`

```r
set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)
```

We previously tested `$X$` against Shapiro-Wilk and Anderson-Darling tests to indicate normality.

Now, we will raise it to the power of two

```r
X2 <- X^2
```

---

Does it still follow a normal distribution?

```r
shapiro.test(X2)
```

```
## 
## 	Shapiro-Wilk normality test
## 
## data:  X2
## W = 0.7, p-value = 5e-13
```

---

# Visual Examination

---

It does not follow normal distribution at all. 
--
Does it follow the `$\chi^2$` distribution though?

```r
ks.test(X2, pchisq, df=1)
```

```
## 
## 	One-sample Kolmogorov-Smirnov test
## 
## data:  X2
## D = 0.1, p-value = 0.2
## alternative hypothesis: two-sided
```

---

# Visual Examination

---

.column.bg-main4[.vmiddle.content[
- Data type
- Probability Density Function
- Goodness of fit test
- Test of normality
- .amber[Central Limit Theorem]
]]

---

# Central Limit Theorem

---

???

`$\xrightarrow{d}$` is a convergence of random variables

.font2[
- So far, we have learnt sampling distributions
- We are also able to compute the mean and standard deviation based on their parameters
- It just happened that the sample mean follow a normal distribution
- .amber[Central limit theorem] delineates such an occurrence
- This rule applies to both .amber[discrete] and .amber[continuous] distribution
]

---

.font2[
- It comes with a trade though
- CLT requires `$n$` as a sufficiently large number
- The number of `$n$` depends on data skewness
- More skewed? More `$n$` required.
]

---

How do we determine `$n$`?
--
`$\to$` Simulation

--
1. Choose any distribution

--
1. Generate `$n$` random numbers using specified parameters

--
1. Compute the mean and variance based on previous parameters `$\to$` Use it to generate a normal distribution

--
1. Reuse the parameters to re-iterate step 2

--
1. Conduct the simulation for an arbitrary number of times (e.g. for convenience, 1000)

--
1. Calculate mean from all generated data `$\to$` Make a histogram and compare it with step 3

--
1. Does not fit normal distribution? `$\to$` Increase `$n$`

---

## Why should you care?

.font2[
- In a research settings, you may find differing average values
- It could happen despite following the exact procedure
- And it is frustrating!
- Knowing CLT, you can prove the difference is indeed within expectation
- Besides, the equation above looks cool ;)
]

---

## Final Excerpts:

.font2[
- CLT describes a tendency of a .amber[mean] `$\bar{x}$` to follow normal distribution
- Requires a sufficient .amber[number of sample] `$n$`
- A .amber[simple simulation] can prove the theorem
]

Notes for current slide

Notes for next slide

Data: Type and Distribution

Aly Lamuri
Indonesia Medical Education and Research Institute

1 / 26

Overview

Data type
Probability Density Function
Goodness of fit test
Test of normality
Central Limit Theorem

1 / 26

Data TypeCategorical
Numeric

1 / 26

Numerous conventions in describing data
Understanding the nature behind categorical and numeric is more important
Examples on established convention:
- Nominal, ordinal, interval, ratio
- Categorical, discrete, continuous

Nominal

Data Type

Categorical
Numeric

1 / 26

Other examples:

Types of car
Brands
Netflix shows

Ordinal

Data Type

Categorical
Numeric

1 / 26

Other examples:

Disease severity
Qualitative measure: bad $\to$ good

Discrete (clue: countable)

Data Type

Categorical
Numeric

1 / 26

Continuous (clue: measurable)

Data Type

Categorical
Numeric

1 / 26

Continuous:

Interval
Ratio

Continuous Data
Interval
Has a fixed distance
Arithmetic: addition and subtraction
Examples:Likert scale
Temperature in other scales


Ratio
Has an absolute zero
Infinitesimal measure
All arithmetic rules are applicable
Examples:Temperature in Kelvin
Weight


2 / 26

How about Likert-type item?Usually uses a distinctive scale out of 4, 5, 7 and 10 units
Some regards Likert-type question as discrete counts
While for others, a continuous interval
Context-dependant

3 / 26

How about Likert-type item?

Usually uses a distinctive scale out of 4, 5, 7 and 10 units
Some regards Likert-type question as discrete counts
While for others, a continuous interval
Context-dependant

3 / 26

Checkpoint! What type of data do we have?3 / 26

Checkpoint! What type of data do we have?We were conducting a survey in three universities.
3 / 26

Checkpoint! What type of data do we have?We were conducting a survey in three universities.
From each university, we sampled the first, second, penultimate and final year students in a four-year programme. 
3 / 26

Checkpoint! What type of data do we have?We were conducting a survey in three universities.
From each university, we sampled the first, second, penultimate and final year students in a four-year programme. 
We nicely asked them to indicate their level of burnout using a Likert-type self-report inventory.
3 / 26

Checkpoint! What type of data do we have?We were conducting a survey in three universities.
From each university, we sampled the first, second, penultimate and final year students in a four-year programme. 
We nicely asked them to indicate their level of burnout using a Likert-type self-report inventory.
We also kindly measured their blood cortisol level.
3 / 26

Overview

Data type
Probability Density Function
Goodness of fit test
Test of normality
Central Limit Theorem

3 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Suppose we have a fair coin and doing a flip 10 times, where H indicates the head and T indicates the tail

4 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Suppose we have a fair coin and doing a flip 10 times, where H indicates the head and T indicates the tail
Then, our sample space:

set.seed(1)
S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()

##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"

4 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Let the head be our expected outcome

4 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Let the head be our expected outcome
Then, our event:

E <- S[which(S == "H")] %T>% print()

## [1] "H" "H" "H" "H" "H" "H"

4 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Thus, we can regard the probability of having a desired outcome as a relative frequency of events in a given sample space

4 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Thus, we can regard the probability of having a desired outcome as a relative frequency of events in a given sample space
As such:

length(E) / length(S)

## [1] 0.6

4 / 26

Probability

An event $E$ occurring within a particular sample space $S$
Event: Expected results
Sample space: All possible outcomes
Probability $P$ is a proportion of event divided by its sample space
Or mathematically:

$P (E = e) = \frac{E}{S}$

Thus, we can regard the probability of having a desired outcome as a relative frequency of events in a given sample space
As such:

length(E) / length(S)

## [1] 0.6

Ten flips using a fair coin resulted in 60% chance of having heads

##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"

4 / 26

Determine the ProbabilityEnumeration
Tree diagram
Resampling

5 / 26

So far, we have learnt about enumeration
In such a method, we determine a probability as a relative frequency measure

Determine the ProbabilityEnumeration
Tree diagram
Resampling

Caveats in enumerationHigher sample space →→ harder to solve
It is more apparent with sequential problem
Sequential problem: when you need to calculate probability from two different instances
Example: the probability of having three 4 while rolling a dice three times
5 / 26

So far, we have learnt about enumeration
In such a method, we determine a probability as a relative frequency measure

Determine the Probability

Enumeration
Tree diagram
Resampling

Caveats in enumeration

Higher sample space $\to$ harder to solve
It is more apparent with sequential problem
Sequential problem: when you need to calculate probability from two different instances
Example: the probability of having three 4 while rolling a dice three times

Tree diagram is available to solve a more complex probability problem

5 / 26

So far, we have learnt about enumeration
In such a method, we determine a probability as a relative frequency measure

Determine the ProbabilityEnumeration
Tree diagram
Resampling

Sample case →→ the urn problemWe have an urn filled with 30 blue and 50 red balls
All balls are identical except for color
In the urn, all balls have an equal distribution
Task: Take three balls without replacement
Question: How high is the chance of getting three blue balls?
5 / 26

Determine the Probability

Enumeration
Tree diagram
Resampling

    B (30/80)
   /
  /
80
  \
   \          
    R (50/80)

5 / 26

Determine the Probability

Enumeration
Tree diagram
Resampling

               B (29/79)
             /
    B (30/80)
   /         \
  /            R (50/79)
80
  \
   \          
    R (50/80)

5 / 26

Determine the Probability

Enumeration
Tree diagram
Resampling

                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /         \
  /            R (50/79)
80
  \
   \          
    R (50/80)

5 / 26

Determine the Probability

Enumeration
Tree diagram
Resampling

                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /         \
  /            R (50/79)
80
  \
   \          
    R (50/80)

The chance for having three blue balls is 0.0494

5 / 26

Determine the Probability

Enumeration
Tree diagram
Resampling

                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /         \          / B (?)
  /            R (50/79)
80                      \ R (?)
  \
   \          
    R (50/80)

We have learnt how to draw a tree diagram. Now, what should we fill the question mark with?

5 / 26

Determine the Probability

Enumeration
Tree diagram
Resampling

                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /         \          / B (29/78)
  /            R (50/79)
80                      \ R (49/78)
  \
   \          
    R (50/80)

5 / 26

Let's roll the dice :)

To learn resampling method, we will conduct a short experiment
This experiment relies on a simple function
Said function will simulate an independent dice-roll
The only parameter is n, indicating the number of roll

dice <- function(n) {
    sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))
}

6 / 26

Let's roll the dice :)

To learn resampling method, we will conduct a short experiment
This experiment relies on a simple function
Said function will simulate an independent dice-roll
The only parameter is n, indicating the number of roll

dice <- function(n) {
    sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))
}

Let's see whether our function work...

6 / 26

Let's roll the dice :)

To learn resampling method, we will conduct a short experiment
This experiment relies on a simple function
Said function will simulate an independent dice-roll
The only parameter is n, indicating the number of roll

dice <- function(n) {
    sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))
}

Let's see whether our function work...

dice(1)

## [1] 3

It does!

6 / 26

Let's roll the dice :)So we shall roll the dice 10 times
Let 4 be our outcome of interest
How high is the probability of having the event within 10 trials?
6 / 26

Let's roll the dice :)

So we shall roll the dice 10 times
Let 4 be our outcome of interest
How high is the probability of having the event within 10 trials?

set.seed(1)
roll <- dice(10) %T>% print()

##  [1] 3 4 5 1 3 1 1 5 5 2

How high is the probability of getting 4?

6 / 26

Let's roll the dice :)

So we shall roll the dice 10 times
Let 4 be our outcome of interest
How high is the probability of having the event within 10 trials?

set.seed(1)
roll <- dice(10) %T>% print()

##  [1] 3 4 5 1 3 1 1 5 5 2

How high is the probability of getting 4?
Turns out, it is 1/10

6 / 26

Let's roll the dice :)

So we shall roll the dice 10 times
Let 4 be our outcome of interest
How high is the probability of having the event within 10 trials?

set.seed(1)
roll <- dice(10) %T>% print()

##  [1] 3 4 5 1 3 1 1 5 5 2

How high is the probability of getting 4?
Turns out, it is 1/10
We have a fair dice, why is the probability not 1/6?

6 / 26

Let's roll the dice :)

So we shall roll the dice 10 times
Let 4 be our outcome of interest
How high is the probability of having the event within 10 trials?

set.seed(1)
roll <- dice(10) %T>% print()

##  [1] 3 4 5 1 3 1 1 5 5 2

How high is the probability of getting 4?
Turns out, it is 1/10
We have a fair dice, why is the probability not 1/6?
Hint: sample and population

6 / 26

Let's roll the dice :)

So we shall roll the dice 10 times
Let 4 be our outcome of interest
How high is the probability of having the event within 10 trials?

set.seed(1)
roll <- dice(10) %T>% print()

##  [1] 3 4 5 1 3 1 1 5 5 2

How high is the probability of getting 4?
Turns out, it is 1/10
We have a fair dice, why is the probability not 1/6?
Hint: sample and population
The more sample we got, the closer it is to represent the population

6 / 26

Let's roll the dice :)What will we get with different number of rolls?
6 / 26

Let's roll the dice :)

What will we get with different number of rolls?
100 rolls:

set.seed(1); roll <- dice(100)
sum(roll==4) / length(roll)

## [1] 0.25

6 / 26

Let's roll the dice :)

What will we get with different number of rolls?
100 rolls:

set.seed(1); roll <- dice(100)
sum(roll==4) / length(roll)

## [1] 0.25

1,000 rolls:

set.seed(1); roll <- dice(1000)
sum(roll==4) / length(roll)

## [1] 0.2

6 / 26

Let's roll the dice :)

What will we get with different number of rolls?
100 rolls:

set.seed(1); roll <- dice(100)
sum(roll==4) / length(roll)

## [1] 0.25

1,000 rolls:

set.seed(1); roll <- dice(1000)
sum(roll==4) / length(roll)

## [1] 0.2

10,000 rolls:

set.seed(1); roll <- dice(10000)
sum(roll==4) / length(roll)

## [1] 0.1724

6 / 26

Let's roll the dice :)

100,000 rolls:

set.seed(1); roll <- dice(100000)
sum(roll==4) / length(roll)

## [1] 0.1661

6 / 26

Let's roll the dice :)

100,000 rolls:

set.seed(1); roll <- dice(100000)
sum(roll==4) / length(roll)

## [1] 0.1661

1,000,000 rolls:

set.seed(1); roll <- dice(1000000)
sum(roll==4) / length(roll)

## [1] 0.1664

6 / 26

Let's roll the dice :)

100,000 rolls:

set.seed(1); roll <- dice(100000)
sum(roll==4) / length(roll)

## [1] 0.1661

1,000,000 rolls:

set.seed(1); roll <- dice(1000000)
sum(roll==4) / length(roll)

## [1] 0.1664

10,000,000 rolls:

set.seed(1); roll <- dice(10000000)
sum(roll==4) / length(roll)

## [1] 0.1666

6 / 26

Let's roll the dice :)With more trials, we get closer to the expected probability in a fair dice
Which is 1/6, or equivalently 0.1667
The error of estimated probability is inversely proportional to the number of trial

6 / 26

Let's roll the dice :)

With more trials, we get closer to the expected probability in a fair dice
Which is 1/6, or equivalently 0.1667
The error of estimated probability is inversely proportional to the number of trial

Or mathematically:

$ϵ = \sqrt{\frac{\hat{p} (1 - \hat{p})}{N}}, w h e r e :$

$ϵ$ : Error
$\hat{p}$ : Estimated probability (current trial)
$N$ : Number of resampling

6 / 26

Let's roll the dice :)

How high is the error in our trials?

6 / 26

Let's roll the dice :)

How high is the error in our trials?

First we need to set the function to calculate error

epsilon <- function(p.hat, n) {
    sqrt({p.hat * (1-p.hat)}/n)
}

6 / 26

Let's roll the dice :)

How high is the error in our trials?

First we need to set the function to calculate error

epsilon <- function(p.hat, n) {
    sqrt({p.hat * (1-p.hat)}/n)
}

Get the roll and probability

roll <- c(10, 100, 1000, 10000, 100000, 1000000, 10000000)
prob <- sapply(roll, function(n) {
    set.seed(1); roll <- dice(n)
    sum(roll==4) / length(roll)
})

6 / 26

Let's roll the dice :)

df <- data.frame(list("roll"=roll, "prob"=prob))
df %>% knitr::kable() %>% kable_styling()

roll	prob
1e+01	0.1000
1e+02	0.2500
1e+03	0.2000
1e+04	0.1724
1e+05	0.1661
1e+06	0.1664
1e+07	0.1666

6 / 26

Let's roll the dice :)

df <- data.frame(list("roll"=roll, "prob"=prob))
df %>% knitr::kable() %>% kable_styling()

roll	prob
1e+01	0.1000
1e+02	0.2500
1e+03	0.2000
1e+04	0.1724
1e+05	0.1661
1e+06	0.1664
1e+07	0.1666

df$error <- mapply(function(p.hat, n) {
    epsilon(p.hat, n)
}, p.hat=df$prob, n=df$roll) %T>% print()

## [1] 0.0948683 0.0433013 0.0126491 0.0037773 0.0011769 0.0003724 0.0001178

6 / 26

Let's roll the dice :)

Here is a nice figure to summarize the concept:

6 / 26

Let's roll the dice :)

And another figure to see the error:

6 / 26

HomeworkPreviously, we used tree diagram to determine the probability in the urn problem
Solve the urn problem using resampling method
Question: What is the probability of getting three red balls?

7 / 26

Homework

Previously, we used tree diagram to determine the probability in the urn problem
Solve the urn problem using resampling method
Question: What is the probability of getting three red balls?

Task description:

Do a trial of ${100, 200, 500, 1000, 2000, 5000}$
Set 1 as the seed for each resampling
Plot the probability and error
Briefly explain your results
You may use any programming language you are familiar with
You just need to present the plot and explanation

7 / 26

Random Variables

Independent vs Identical? $\to$ I.I.D

8 / 26

Random Variables

Independent vs Identical? $\to$ I.I.D

All sampled random variables should be independent from one another
Each sampling procedure have to be identical, as to produce similar probability

8 / 26

Random Variables

Independent vs Identical? $\to$ I.I.D

All sampled random variables should be independent from one another
Each sampling procedure have to be identical, as to produce similar probability

Considering I.I.D, can we do a better probability estimation?

8 / 26

Random Variables

Independent vs Identical? $\to$ I.I.D

All sampled random variables should be independent from one another
Each sampling procedure have to be identical, as to produce similar probability

Considering I.I.D, can we do a better probability estimation?

If they are I.I.D, we can approximate the probability using:
- Probability Mass Function (discrete variable)
- Probability Density Function (continuous variable)

8 / 26

Random Variables

Independent vs Identical? $\to$ I.I.D

All sampled random variables should be independent from one another
Each sampling procedure have to be identical, as to produce similar probability

Considering I.I.D, can we do a better probability estimation?

If they are I.I.D, we can approximate the probability using:
- Probability Mass Function (discrete variable)
- Probability Density Function (continuous variable)

In math, please?

$\begin{aligned} (1) & P (E = e) & = f (e) > 0 : E \in S \\ (2) & \sum_{e \in S} f (e) & = 1 \\ (3) & P (E \in A) & = \sum_{e \in A} f (e) : A \subset S \end{aligned}$

8 / 26

The function is arbitrary, it can take on any form
There are myriad distributions
We will look at specific examples

Binomial DistributionHave an identical iteration over nn times of trial
Each iteration corresponds to a Bernoulli trial
All instances are independent

9 / 26

Binomial Distribution

Have an identical iteration over $n$ times of trial
Each iteration corresponds to a Bernoulli trial
All instances are independent

$\begin{aligned} (1) & f (x) & = (\binom{n}{x}) p^{x} (1 - p)^{n - x} \\ (\binom{n}{x}) & = \frac{n!}{x! (n - x)!} \end{aligned}$

Or simply denoted as: $X \sim B (n, p)$

9 / 26

Binomial Distribution

Have an identical iteration over $n$ times of trial
Each iteration corresponds to a Bernoulli trial
All instances are independent

$\begin{aligned} (1) & f (x) & = (\binom{n}{x}) p^{x} (1 - p)^{n - x} \\ (\binom{n}{x}) & = \frac{n!}{x! (n - x)!} \end{aligned}$

Or simply denoted as: $X \sim B (n, p)$

$\begin{aligned} μ & = n \cdot p \\ σ & = \sqrt{μ \cdot (1 - p)} \end{aligned}$

9 / 26

Binomial Distribution

Have an identical iteration over $n$ times of trial
Each iteration corresponds to a Bernoulli trial
All instances are independent

9 / 26

Binomial Distribution

Have an identical iteration over $n$ times of trial
Each iteration corresponds to a Bernoulli trial
All instances are independent

9 / 26

Geometric DistributionDescribes number of failures before getting an event
Follows Bernoulli trial
A derivation of binomial distribution, with x=1x=1

10 / 26

Geometric Distribution

Describes number of failures before getting an event
Follows Bernoulli trial
A derivation of binomial distribution, with $x = 1$

$f (n) = P (X = n) = p (1 - p)^{n - 1}, w i t h :$

$n$ : Number of trials to get an event
$p$ : The probability of getting an event
Or simply denoted as $X \sim G (p)$

10 / 26

Geometric Distribution

Describes number of failures before getting an event
Follows Bernoulli trial
A derivation of binomial distribution, with $x = 1$

$f (n) = P (X = n) = p (1 - p)^{n - 1}, w i t h :$

$n$ : Number of trials to get an event
$p$ : The probability of getting an event
Or simply denoted as $X \sim G (p)$

$\begin{aligned} μ & = \frac{1}{p} \\ σ & = \sqrt{\frac{1 - p}{p^{2}}} \end{aligned}$

10 / 26

Geometric Distribution

Describes number of failures before getting an event
Follows Bernoulli trial
A derivation of binomial distribution, with $x = 1$

10 / 26

Poisson DistributionSuppose we know the rate of certain outcomes
Poisson distribution defines the probability of an outcome happening xx times
Limited to a particular time frame (often described as observation period)

11 / 26

Poisson Distribution

Suppose we know the rate of certain outcomes
Poisson distribution defines the probability of an outcome happening $x$ times
Limited to a particular time frame (often described as observation period)

$f (x) = \frac{e^{- λ} λ^{x}}{x!}, w i t h :$

$x$ : The number of expected events
$e$ : Euler's number
$λ$ : Average number of events in one time frame

Or simply denoted as $X \sim P (λ)$

$\begin{aligned} μ & = λ \\ σ & = \sqrt{λ} \end{aligned}$

11 / 26

Poisson Distribution

Suppose we know the rate of certain outcomes
Poisson distribution defines the probability of an outcome happening $x$ times
Limited to a particular time frame (often described as observation period)

11 / 26

Uniform DistributionA continuous function describing uniform probabilities
Hence the name: uniform distribution
Useful in random number generator →→ for randomization in clinical trials

12 / 26

We finished the first part of distribution: discrete
Now, we shall see continuous distributions and their properties

Uniform Distribution

A continuous function describing uniform probabilities
Hence the name: uniform distribution
Useful in random number generator $\to$ for randomization in clinical trials

$f (x) = \frac{1}{b - a}$

Or simply denoted as $X \sim U (a, b)$

12 / 26

We finished the first part of distribution: discrete
Now, we shall see continuous distributions and their properties

Uniform Distribution

A continuous function describing uniform probabilities
Hence the name: uniform distribution
Useful in random number generator $\to$ for randomization in clinical trials

$f (x) = \frac{1}{b - a}$

Or simply denoted as $X \sim U (a, b)$

$\begin{aligned} μ & = \frac{b + a}{2} \\ σ & = \frac{(b - a)^{2}}{12} \end{aligned}$

12 / 26

We finished the first part of distribution: discrete
Now, we shall see continuous distributions and their properties

Uniform Distribution

A continuous function describing uniform probabilities
Hence the name: uniform distribution
Useful in random number generator $\to$ for randomization in clinical trials

12 / 26

We finished the first part of distribution: discrete
Now, we shall see continuous distributions and their properties

Exponential DistributionA reparameterization of Poisson distribution
We are interested to see how long of a time frame needed to observe an event

13 / 26

Time frame is intangible
It is not always time, it could be other continuous measures
Examples: Mileage, weight, volume, etc.

Exponential Distribution

A reparameterization of Poisson distribution
We are interested to see how long of a time frame needed to observe an event

$f (x) = λ e^{- x λ}, w i t h :$

$x$ : Time needed to observe an event
$λ$ : The rate for a certain event

Or simply denoted as $X \sim E x p o n e n t i a l (λ)$

$μ = σ = \frac{1}{λ}$

13 / 26

Time frame is intangible
It is not always time, it could be other continuous measures
Examples: Mileage, weight, volume, etc.

Exponential Distribution

A reparameterization of Poisson distribution
We are interested to see how long of a time frame needed to observe an event

13 / 26

Gamma DistributionExponential distribution is a gamma distribution without a shape parameter
Essentially, gamma distribution finds its uses in similar cases as exponential distribution
Relies on the gamma function Γ(α)Γ(α)
14 / 26

Gamma Distribution

Exponential distribution is a gamma distribution without a shape parameter
Essentially, gamma distribution finds its uses in similar cases as exponential distribution
Relies on the gamma function $Γ (α)$

$\begin{aligned} f (x) & = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- x β} \\ Γ (α) & = \int_{0}^{\infty} y^{α - 1} e^{- y} d y, w i t h : \end{aligned}$

$β$ : Rate ( $λ$ in exponential PDF)
$α$ : Shape
$Γ$ : Gamma function
$e$ : Euler number

Or simply denoted as $X \sim Γ (α, β)$

14 / 26

Gamma Distribution

Exponential distribution is a gamma distribution without a shape parameter
Essentially, gamma distribution finds its uses in similar cases as exponential distribution
Relies on the gamma function $Γ (α)$

$\begin{aligned} f (x) & = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- x β} \\ Γ (α) & = \int_{0}^{\infty} y^{α - 1} e^{- y} d y, w i t h : \end{aligned}$

$β$ : Rate ( $λ$ in exponential PDF)
$α$ : Shape
$Γ$ : Gamma function
$e$ : Euler number

Or simply denoted as $X \sim Γ (α, β)$

If we were to assign the shape parameter $α = 1$ , we get an exponential PDF.

14 / 26

Gamma Distribution

Exponential distribution is a gamma distribution without a shape parameter
Essentially, gamma distribution finds its uses in similar cases as exponential distribution
Relies on the gamma function $Γ (α)$

$\begin{aligned} f (x) & = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- x β} \\ Γ (α) & = \int_{0}^{\infty} y^{α - 1} e^{- y} d y, w i t h : \end{aligned}$

$β$ : Rate ( $λ$ in exponential PDF)
$α$ : Shape
$Γ$ : Gamma function
$e$ : Euler number

Or simply denoted as $X \sim Γ (α, β)$

If we were to assign the shape parameter $α = 1$ , we get an exponential PDF. Therefore, $E x p o n e n t i a l (λ) \sim Γ (1, λ)$ .

14 / 26

Gamma Distribution

Exponential distribution is a gamma distribution without a shape parameter
Essentially, gamma distribution finds its uses in similar cases as exponential distribution
Relies on the gamma function $Γ (α)$

$\begin{aligned} f (x) & = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- x β} \\ Γ (α) & = \int_{0}^{\infty} y^{α - 1} e^{- y} d y, w i t h : \end{aligned}$

$β$ : Rate ( $λ$ in exponential PDF)
$α$ : Shape
$Γ$ : Gamma function
$e$ : Euler number

Or simply denoted as $X \sim Γ (α, β)$

If we were to assign the shape parameter $α = 1$ , we get an exponential PDF. Therefore, $E x p o n e n t i a l (λ) \sim Γ (1, λ)$ .

$\begin{aligned} μ & = \frac{α}{β} \\ σ & = \frac{\sqrt{α}}{β} \end{aligned}$

14 / 26

Gamma Distribution

Exponential distribution is a gamma distribution without a shape parameter
Essentially, gamma distribution finds its uses in similar cases as exponential distribution
Relies on the gamma function $Γ (α)$

14 / 26

χ2χ2 DistributionsSpecial cases of a Gamma distribution
Widely used in statistical inferences

15 / 26

$χ^{2}$ Distributions

Special cases of a Gamma distribution
Widely used in statistical inferences

$f (x) = \frac{1}{Γ (k / 2) 2^{k / 2}} x^{k / 2 - 1} e^{- x / 2}, w i t h :$

$k$ : Degree of freedom
The rest are Gamma PDF derivations

Or simply denoted as $X \sim χ^{2} (k)$

$\begin{aligned} μ & = k \\ σ & = \sqrt{2 k} \end{aligned}$

15 / 26

$χ^{2}$ Distributions

Special cases of a Gamma distribution
Widely used in statistical inferences

$f (x) = \frac{1}{Γ (k / 2) 2^{k / 2}} x^{k / 2 - 1} e^{- x / 2}, w i t h :$

$k$ : Degree of freedom
The rest are Gamma PDF derivations

Or simply denoted as $X \sim χ^{2} (k)$

$\begin{aligned} μ & = k \\ σ & = \sqrt{2 k} \end{aligned}$

Relation to normal distribution?

15 / 26

$χ^{2}$ Distributions

Special cases of a Gamma distribution
Widely used in statistical inferences

15 / 26

Normal DistributionUbiquitous in real-world data
Symmetric with μμ and σσ completely describes the distribution

16 / 26

Normal Distribution

Ubiquitous in real-world data
Symmetric with $μ$ and $σ$ completely describes the distribution

$f (x) = \frac{1}{σ \sqrt{2 π}} e x p {- \frac{1}{2} (\frac{x - μ}{σ})^{2}}, w i t h :$

$x \in R : - \infty < x < \infty$
$μ \in R : - \infty < μ < \infty$
$σ \in R : 0 < σ < \infty$

Or simply denoted as $X \sim N (μ, σ)$

16 / 26

Normal Distribution

Ubiquitous in real-world data
Symmetric with $μ$ and $σ$ completely describes the distribution

16 / 26

Normal Distribution

Ubiquitous in real-world data
Symmetric with $μ$ and $σ$ completely describes the distribution

16 / 26

Normal Distribution

Ubiquitous in real-world data
Symmetric with $μ$ and $σ$ completely describes the distribution

16 / 26

Overview

Data type
Probability Density Function
Goodness of fit test
Test of normality
Central Limit Theorem

16 / 26

Goodness of Fit TestTo determine whether your data follow a certain distribution
Numerous methods exist, we will dig into more popular ones
Given correct parameters, some methods can fully describe your data
H0H0: Given data follow a certain distribution
H1H1: Given data does not follow a certain distribution

17 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

$P r (X = k) = (\binom{n}{k}) p^{k} (1 - p)^{n - k}$

18 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

Remember we previously tossed a coin 10 times?

18 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

Remember we previously tossed a coin 10 times?

set.seed(1)
S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()

##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"

length(E) / length(S)

## [1] 0.6

18 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

Remember we previously tossed a coin 10 times?

set.seed(1)
S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()

##  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"

length(E) / length(S)

## [1] 0.6

If it represents a Bernoulli trial, it should satisfy $P (X = 6)$ in such a way that we cannot reject the $H_{0}$ when calculating its probability:

$P (X = 6) = (\binom{10}{6}) {0.5}^{6} (1 - 0.5)^{4}$

18 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

Luckily, we do not need to compute it by hand

18 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

Luckily, we do not need to compute it by hand (yet)

binom.test(x=6, n=10, p=0.5)

## 
##     Exact binomial test
## 
## data:  6 and 10
## number of successes = 6, number of trials = 10, p-value = 0.8
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2624 0.8784
## sample estimates:
## probability of success 
##                    0.6

18 / 26

Binomial Test

An adaptation from binomial PMF
To determine whether acquired probability followed Bernoulli trial's

Luckily, we do not need to compute it by hand (yet)

binom.test(x=6, n=10, p=0.5)

## 
##     Exact binomial test
## 
## data:  6 and 10
## number of successes = 6, number of trials = 10, p-value = 0.8
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.2624 0.8784
## sample estimates:
## probability of success 
##                    0.6

Interpreting the p-value, we cannot reject the $H_{0}$ , so our coin toss followed the Bernoulli trial after all.

18 / 26

Kolmogorov-Smirnov TestThis test is available to determine various distribution
Works as a non-parametric test
Pretty much robust, only second to Anderson-Darling test on normal distribution

19 / 26

Robustness based on yielded power

Kolmogorov-Smirnov Test

This test is available to determine various distribution
Works as a non-parametric test
Pretty much robust, only second to Anderson-Darling test on normal distribution

Let $X \sim E x p o n e n t i a l (2) : n = 100$

set.seed(1); X <- rexp(n=100, rate=2)

19 / 26

Robustness based on yielded power

Kolmogorov-Smirnov Test

This test is available to determine various distribution
Works as a non-parametric test
Pretty much robust, only second to Anderson-Darling test on normal distribution

Let $X \sim E x p o n e n t i a l (2) : n = 100$

set.seed(1); X <- rexp(n=100, rate=2)

By imputing $λ$ variable, Kolmogorov-Smirnov can compute its goodness of fit

ks.result <- ks.test(X, pexp, rate=2)

19 / 26

Robustness based on yielded power

Kolmogorov-Smirnov Test

This test is available to determine various distribution
Works as a non-parametric test
Pretty much robust, only second to Anderson-Darling test on normal distribution

print(ks.result)

## 
##     One-sample Kolmogorov-Smirnov test
## 
## data:  X
## D = 0.084, p-value = 0.5
## alternative hypothesis: two-sided

19 / 26

Robustness based on yielded power

Visual ExaminationOkay, doing math is cool and all
But in a large sample, even a small deviation will result in H0H0 rejection
Which mean, previously mentioned tests are of no use!
We can rely on some visual cues to determine the distribution though

20 / 26

Visual Examination

Okay, doing math is cool and all
But in a large sample, even a small deviation will result in $H_{0}$ rejection
Which mean, previously mentioned tests are of no use!
We can rely on some visual cues to determine the distribution though

For this demonstration, I will again use the previous object $X$

20 / 26

Visual Examination

20 / 26

Visual Examination

Hey, that's a good start!

20 / 26

Visual Examination

Hey, that's a good start! This does not clearly suggest a specific distribution though :(

20 / 26

Visual Examination

20 / 26

Visual Examination

Quantile-Quantile Plot (QQ Plot) can give a better visual cue :)

20 / 26

Overview

Data type
Probability Density Function
Goodness of fit test
Test of normality
Central Limit Theorem

20 / 26

Test of NormalityPractically a subset of goodness of fit test
Some are more appropriate under certain circumstances
We shall see through widely used ones
H0H0: Sample follows the normal distribution
H0H0: Sample does not follow the normal distribution

21 / 26

Shapiro-Wilk TestA well-established test to assess normality
Can tolerate skewness to a certain degree
Implementation in R: sample size between 3 and 5000

22 / 26

Anderson-Darling TestLess well-known compared to Shapiro-Wilk
Gives more weight to the tails
Implementation in R: minimum sample size is 7

23 / 26

Demonstration

Let $X \sim N (0, 1) : n = 100$

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

24 / 26

Demonstration

Let $X \sim N (0, 1) : n = 100$

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

Shapiro-Wilk

shapiro.test(X)

## 
##     Shapiro-Wilk normality test
## 
## data:  X
## W = 1, p-value = 1

24 / 26

Demonstration

Let $X \sim N (0, 1) : n = 100$

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

Anderson-Darling

nortest::ad.test(X)

## 
##     Anderson-Darling normality test
## 
## data:  X
## A = 0.16, p-value = 0.9

24 / 26

Visual Examination

24 / 26

χ2χ2 and Normal DistributionRaise a normally distributed data to the power of two
It shall follow a χ2χ2 distribution with 1 degree of freedom

25 / 26

$χ^{2}$ and Normal Distribution

Raise a normally distributed data to the power of two
It shall follow a $χ^{2}$ distribution with 1 degree of freedom

For demonstration purposes, we will re-use $X \sim N (0, 1) : n = 100$

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

We previously tested $X$ against Shapiro-Wilk and Anderson-Darling tests to indicate normality.

25 / 26

$χ^{2}$ and Normal Distribution

Raise a normally distributed data to the power of two
It shall follow a $χ^{2}$ distribution with 1 degree of freedom

For demonstration purposes, we will re-use $X \sim N (0, 1) : n = 100$

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

We previously tested $X$ against Shapiro-Wilk and Anderson-Darling tests to indicate normality.

Now, we will raise it to the power of two

X2 <- X^2

25 / 26

$χ^{2}$ and Normal Distribution

Raise a normally distributed data to the power of two
It shall follow a $χ^{2}$ distribution with 1 degree of freedom

Does it still follow a normal distribution?

shapiro.test(X2)

## 
##     Shapiro-Wilk normality test
## 
## data:  X2
## W = 0.7, p-value = 5e-13

25 / 26

Visual Examination

25 / 26

$χ^{2}$ and Normal Distribution

Raise a normally distributed data to the power of two
It shall follow a $χ^{2}$ distribution with 1 degree of freedom

It does not follow normal distribution at all.

25 / 26

$χ^{2}$ and Normal Distribution

Raise a normally distributed data to the power of two
It shall follow a $χ^{2}$ distribution with 1 degree of freedom

It does not follow normal distribution at all. Does it follow the $χ^{2}$ distribution though?

25 / 26

$χ^{2}$ and Normal Distribution

Raise a normally distributed data to the power of two
It shall follow a $χ^{2}$ distribution with 1 degree of freedom

It does not follow normal distribution at all. Does it follow the $χ^{2}$ distribution though?

ks.test(X2, pchisq, df=1)

## 
##     One-sample Kolmogorov-Smirnov test
## 
## data:  X2
## D = 0.1, p-value = 0.2
## alternative hypothesis: two-sided

25 / 26

Visual Examination

25 / 26

Overview

Data type
Probability Density Function
Goodness of fit test
Test of normality
Central Limit Theorem

25 / 26

Central Limit Theorem