Data: Type and Distribution
Aly Lamuri
Indonesia Medical Education and Research Institute
Overview
Nominal
Other examples:
Ordinal
Other examples:
Discrete (clue: countable)
Continuous (clue: measurable)
Continuous:
Overview
P(E=e)=ES
H
indicates the
head and T
indicates the tailP(E=e)=ES
H
indicates the
head and T
indicates the tailset.seed(1)S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
## [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
P(E=e)=ES
P(E=e)=ES
E <- S[which(S == "H")] %T>% print()
## [1] "H" "H" "H" "H" "H" "H"
P(E=e)=ES
P(E=e)=ES
length(E) / length(S)
## [1] 0.6
P(E=e)=ES
length(E) / length(S)
## [1] 0.6
## [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
4
while rolling a dice three times4
while rolling a dice three timesTree diagram is available to solve a more complex probability problem
B (30/80) / /80 \ \ R (50/80)
B (29/79) / B (30/80) / \ / R (50/79)80 \ \ R (50/80)
/ B (28/78) B (29/79) / \ R (50/78) B (30/80) / \ / R (50/79)80 \ \ R (50/80)
/ B (28/78) B (29/79) / \ R (50/78) B (30/80) / \ / R (50/79)80 \ \ R (50/80)
The chance for having three blue balls is 0.0494
/ B (28/78) B (29/79) / \ R (50/78) B (30/80) / \ / B (?) / R (50/79)80 \ R (?) \ \ R (50/80)
We have learnt how to draw a tree diagram. Now, what should we fill the question mark with?
/ B (28/78) B (29/79) / \ R (50/78) B (30/80) / \ / B (29/78) / R (50/79)80 \ R (49/78) \ \ R (50/80)
n
, indicating the number of roll dice <- function(n) { sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))}
n
, indicating the number of roll dice <- function(n) { sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))}
Let's see whether our function work...
n
, indicating the number of roll dice <- function(n) { sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))}
Let's see whether our function work...
dice(1)
## [1] 3
It does!
set.seed(1)roll <- dice(10) %T>% print()
## [1] 3 4 5 1 3 1 1 5 5 2
set.seed(1)roll <- dice(10) %T>% print()
## [1] 3 4 5 1 3 1 1 5 5 2
set.seed(1)roll <- dice(10) %T>% print()
## [1] 3 4 5 1 3 1 1 5 5 2
set.seed(1)roll <- dice(10) %T>% print()
## [1] 3 4 5 1 3 1 1 5 5 2
set.seed(1)roll <- dice(10) %T>% print()
## [1] 3 4 5 1 3 1 1 5 5 2
set.seed(1); roll <- dice(100)sum(roll==4) / length(roll)
## [1] 0.25
set.seed(1); roll <- dice(100)sum(roll==4) / length(roll)
## [1] 0.25
set.seed(1); roll <- dice(1000)sum(roll==4) / length(roll)
## [1] 0.2
set.seed(1); roll <- dice(100)sum(roll==4) / length(roll)
## [1] 0.25
set.seed(1); roll <- dice(1000)sum(roll==4) / length(roll)
## [1] 0.2
set.seed(1); roll <- dice(10000)sum(roll==4) / length(roll)
## [1] 0.1724
set.seed(1); roll <- dice(100000)sum(roll==4) / length(roll)
## [1] 0.1661
set.seed(1); roll <- dice(100000)sum(roll==4) / length(roll)
## [1] 0.1661
set.seed(1); roll <- dice(1000000)sum(roll==4) / length(roll)
## [1] 0.1664
set.seed(1); roll <- dice(100000)sum(roll==4) / length(roll)
## [1] 0.1661
set.seed(1); roll <- dice(1000000)sum(roll==4) / length(roll)
## [1] 0.1664
set.seed(1); roll <- dice(10000000)sum(roll==4) / length(roll)
## [1] 0.1666
Or mathematically:
ϵ=√^p(1−^p)N, where:
ϵ: Error
^p: Estimated probability (current trial)
N: Number of resampling
How high is the error in our trials?
How high is the error in our trials?
epsilon <- function(p.hat, n) { sqrt({p.hat * (1-p.hat)}/n)}
How high is the error in our trials?
epsilon <- function(p.hat, n) { sqrt({p.hat * (1-p.hat)}/n)}
roll <- c(10, 100, 1000, 10000, 100000, 1000000, 10000000)prob <- sapply(roll, function(n) { set.seed(1); roll <- dice(n) sum(roll==4) / length(roll)})
df <- data.frame(list("roll"=roll, "prob"=prob))df %>% knitr::kable() %>% kable_styling()
roll | prob |
---|---|
1e+01 | 0.1000 |
1e+02 | 0.2500 |
1e+03 | 0.2000 |
1e+04 | 0.1724 |
1e+05 | 0.1661 |
1e+06 | 0.1664 |
1e+07 | 0.1666 |
df <- data.frame(list("roll"=roll, "prob"=prob))df %>% knitr::kable() %>% kable_styling()
roll | prob |
---|---|
1e+01 | 0.1000 |
1e+02 | 0.2500 |
1e+03 | 0.2000 |
1e+04 | 0.1724 |
1e+05 | 0.1661 |
1e+06 | 0.1664 |
1e+07 | 0.1666 |
df$error <- mapply(function(p.hat, n) { epsilon(p.hat, n)}, p.hat=df$prob, n=df$roll) %T>% print()
## [1] 0.0948683 0.0433013 0.0126491 0.0037773 0.0011769 0.0003724 0.0001178
Here is a nice figure to summarize the concept:
And another figure to see the error:
Task description:
1
as the seed for each resamplingIndependent vs Identical? → I.I.D
Independent vs Identical? → I.I.D
Independent vs Identical? → I.I.D
Considering I.I.D, can we do a better probability estimation?
Independent vs Identical? → I.I.D
Considering I.I.D, can we do a better probability estimation?
Independent vs Identical? → I.I.D
Considering I.I.D, can we do a better probability estimation?
In math, please?
P(E=e)=f(e)>0:E∈S∑e∈Sf(e)=1P(E∈A)=∑e∈Af(e):A⊂S(1)(2)(3)
f(x)=(nx)px(1−p)n−x(nx)=n!x!(n−x)!(1)
Or simply denoted as: X∼B(n,p)
f(x)=(nx)px(1−p)n−x(nx)=n!x!(n−x)!(1)
Or simply denoted as: X∼B(n,p)
μ=n⋅pσ=√μ⋅(1−p)
f(n)=P(X=n)=p(1−p)n−1,with:
n: Number of trials to get an event
p: The probability of getting an event
Or simply denoted as X∼G(p)
f(n)=P(X=n)=p(1−p)n−1,with:
n: Number of trials to get an event
p: The probability of getting an event
Or simply denoted as X∼G(p)
μ=1pσ=√1−pp2
f(x)=e−λλxx!, with:
x: The number of expected events
e: Euler's number
λ: Average number of events in one time frame
Or simply denoted as X∼P(λ)
μ=λσ=√λ
f(x)=1b−a
Or simply denoted as X∼U(a,b)
f(x)=1b−a
Or simply denoted as X∼U(a,b)
μ=b+a2σ=(b−a)212
f(x)=λe−xλ,with:
x: Time needed to observe an event
λ: The rate for a certain event
Or simply denoted as X∼Exponential(λ)
μ=σ=1λ
f(x)=βαΓ(α)xα−1e−xβΓ(α)=∫∞0yα−1e−y dy, with:
β: Rate ( λ in exponential PDF)
α: Shape
Γ: Gamma function
e: Euler number
Or simply denoted as X∼Γ(α,β)
f(x)=βαΓ(α)xα−1e−xβΓ(α)=∫∞0yα−1e−y dy, with:
β: Rate ( λ in exponential PDF)
α: Shape
Γ: Gamma function
e: Euler number
Or simply denoted as X∼Γ(α,β)
If we were to assign the shape parameter α=1, we get an exponential PDF.
f(x)=βαΓ(α)xα−1e−xβΓ(α)=∫∞0yα−1e−y dy, with:
β: Rate ( λ in exponential PDF)
α: Shape
Γ: Gamma function
e: Euler number
Or simply denoted as X∼Γ(α,β)
If we were to assign the shape parameter α=1, we get an exponential PDF. Therefore, Exponential(λ)∼Γ(1,λ).
f(x)=βαΓ(α)xα−1e−xβΓ(α)=∫∞0yα−1e−y dy, with:
β: Rate ( λ in exponential PDF)
α: Shape
Γ: Gamma function
e: Euler number
Or simply denoted as X∼Γ(α,β)
If we were to assign the shape parameter α=1, we get an exponential PDF. Therefore, Exponential(λ)∼Γ(1,λ).
μ=αβσ=√αβ
f(x)=1Γ(k/2)2k/2xk/2−1e−x/2, with:
k: Degree of freedom
The rest are Gamma PDF derivations
Or simply denoted as X∼χ2(k)
μ=kσ=√2k
f(x)=1Γ(k/2)2k/2xk/2−1e−x/2, with:
k: Degree of freedom
The rest are Gamma PDF derivations
Or simply denoted as X∼χ2(k)
μ=kσ=√2k
Relation to normal distribution?
f(x)=1σ√2πexp{−12(x−μσ)2}, with:
x∈R:−∞<x<∞
μ∈R:−∞<μ<∞
σ∈R:0<σ<∞
Or simply denoted as X∼N(μ,σ)
Overview
Pr(X=k)=(nk)pk(1−p)n−k
Remember we previously tossed a coin 10 times?
Remember we previously tossed a coin 10 times?
set.seed(1)S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
## [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
length(E) / length(S)
## [1] 0.6
Remember we previously tossed a coin 10 times?
set.seed(1)S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
## [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
length(E) / length(S)
## [1] 0.6
If it represents a Bernoulli trial, it should satisfy P(X=6) in such a way that we cannot reject the H0 when calculating its probability:
P(X=6)=(106)0.56(1−0.5)4
Luckily, we do not need to compute it by hand
Luckily, we do not need to compute it by hand (yet)
binom.test(x=6, n=10, p=0.5)
## ## Exact binomial test## ## data: 6 and 10## number of successes = 6, number of trials = 10, p-value = 0.8## alternative hypothesis: true probability of success is not equal to 0.5## 95 percent confidence interval:## 0.2624 0.8784## sample estimates:## probability of success ## 0.6
Luckily, we do not need to compute it by hand (yet)
binom.test(x=6, n=10, p=0.5)
## ## Exact binomial test## ## data: 6 and 10## number of successes = 6, number of trials = 10, p-value = 0.8## alternative hypothesis: true probability of success is not equal to 0.5## 95 percent confidence interval:## 0.2624 0.8784## sample estimates:## probability of success ## 0.6
Interpreting the p-value, we cannot reject the H0, so our coin toss followed the Bernoulli trial after all.
Robustness based on yielded power
Let X∼Exponential(2):n=100
set.seed(1); X <- rexp(n=100, rate=2)
Robustness based on yielded power
Let X∼Exponential(2):n=100
set.seed(1); X <- rexp(n=100, rate=2)
By imputing λ variable, Kolmogorov-Smirnov can compute its goodness of fit
ks.result <- ks.test(X, pexp, rate=2)
Robustness based on yielded power
print(ks.result)
## ## One-sample Kolmogorov-Smirnov test## ## data: X## D = 0.084, p-value = 0.5## alternative hypothesis: two-sided
Robustness based on yielded power
For this demonstration, I will again use the previous object X
Hey, that's a good start!
Hey, that's a good start! This does not clearly suggest a specific distribution though :(
Quantile-Quantile Plot (QQ Plot) can give a better visual cue :)
Overview
R
: sample size between 3 and 5000R
: minimum sample size is 7Let X∼N(0,1):n=100
set.seed(1)X <- rnorm(n=100, mean=0, sd=1)
Let X∼N(0,1):n=100
set.seed(1)X <- rnorm(n=100, mean=0, sd=1)
shapiro.test(X)
## ## Shapiro-Wilk normality test## ## data: X## W = 1, p-value = 1
Let X∼N(0,1):n=100
set.seed(1)X <- rnorm(n=100, mean=0, sd=1)
nortest::ad.test(X)
## ## Anderson-Darling normality test## ## data: X## A = 0.16, p-value = 0.9
For demonstration purposes, we will re-use X∼N(0,1):n=100
set.seed(1)X <- rnorm(n=100, mean=0, sd=1)
We previously tested X against Shapiro-Wilk and Anderson-Darling tests to indicate normality.
For demonstration purposes, we will re-use X∼N(0,1):n=100
set.seed(1)X <- rnorm(n=100, mean=0, sd=1)
We previously tested X against Shapiro-Wilk and Anderson-Darling tests to indicate normality.
Now, we will raise it to the power of two
X2 <- X^2
Does it still follow a normal distribution?
shapiro.test(X2)
## ## Shapiro-Wilk normality test## ## data: X2## W = 0.7, p-value = 5e-13
It does not follow normal distribution at all.
It does not follow normal distribution at all. Does it follow the χ2 distribution though?
It does not follow normal distribution at all. Does it follow the χ2 distribution though?
ks.test(X2, pchisq, df=1)
## ## One-sample Kolmogorov-Smirnov test## ## data: X2## D = 0.1, p-value = 0.2## alternative hypothesis: two-sided
Overview
¯Xd→N(μ,σ√n)as n→∞
d→ is a convergence of random variables
¯Xd→N(μ,σ√n)as n→∞
d→ is a convergence of random variables
¯Xd→N(μ,σ√n)as n→∞
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
How do we determine n?→ Simulation
¯Xd→N(μ,σ√n)as n→∞
¯Xd→N(μ,σ√n)as n→∞
Overview
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |