Slide

To comprehend our surroundings, we rely on processed sensory inputs and infer the best action to respond with. In a more complex (or complicated) situation, we often find recorded inputs more reliable than our sole perception. Recorded inputs are data, which may or may not follow a certain structure. In essence, data is distinguishable based on their properties, resulting in different data types. With more collected data, a particular pattern emerges, where we can observe a peculiar behaviour from one pattern compared to the other. Here we discuss the nature of data, viz. its type and distribution.

Type of Data

Exist are numerous conventions in describing data. However, we shall keep in mind that understanding the nature behind categorical and numeric data is more important. We may further divide categorical data into nominal and ordinal, while numeric into discrete and continuous. Looking at continuous data, it is also of interest to distinguish interval and ratio data.

Categorical - Nominal

Nominal data are observable through its distinctive characteristics. Differences seen in nominal data is not directly comparable, where it subsists only to help identifying one object from another. The example of nominal data includes genders, brand of a certain product, animal species, type of tea, and the list goes further. It is important to notice that nominal data does not only come in two classes, but also multiple classes.

Categorical - Ordinal

The defining attribute of ordinal data is existing stratification among its member, enable a direct comparison. There are numerous examples on ordinal data, where spiciness level, education background, malignancy and severity are to name but a few. As we shall imagine, we can directly compare spicy foods to less spicy ones. We could not confidently say the same with nominal data, as we cannot determine which brand is more than another. For instances, we cannot say Anchor as being more butter compared to Wijsman, that does not make any sense. But we can say cayenne peppers as being spicier compared to bird’s eye chili peppers.

Numeric - Discrete

A discrete data is countable, indicating its availability to a subset of arithmetical operations, including addition and subtraction. Any frequency measure is a type of discrete numeric data. Previously, we have seen the difference between ordinal and nominal data. However, any class from ordinal or nominal is a subject to frequency measure. It means that we can count the number of female students in a university, also the how many bird’s eye chilli peppers we have in a box. In such a situation, both female student and a particular type of chilli pepper still belongs categorical data, respectively. But we shall see that the count is a discrete numeric data. So, in essence, discrete data is the result of frequency counting. In other words, countable defines the discrete numeric data. Other examples of discrete data includes the number of apples in a store, jugs arranged in a shelf, plastic bottle wastes on the shore, and so on.

Numeric - Continuous

If being countable belongs to discrete data, then measurable is the main characteristics of continuous numeric data. Continuous data assumes the availability of its subjects to a certain measurement instrument. In measuring an object, we may stumble upon a problem of whether we are looking at its interval or the actual ratio. What is the difference?

Interval

Interval data has no absolute zero, but it presents with a fixed distance. Some arithmetical operations are applicable, including addition and subtraction. Examples of this data type includes temperature in scales other than Kelvin. To make sense interval data, imagine it this way: you are measuring water temperature using a thermometer. Suppose you have two cups of water, one filled with hot and another with cold water. You measured both cups, where the hot one is about 40oC, and the cold one is 20oC. We can say the hot cup as being warmer, but we cannot state 40oC being twice as hot as 20oC because Celcius is a relative measure. We shall see how Kelvin differ in that respect.

Ratio

In ratio data, we do not assume a fixed distance, as the measure may take an infinitesimal values. However, ratio has an absolute zero. Meaning that any measurement of \(x\) has elements of positive real number \(\mathbb{R}\), with zero being the lower bound, or simply \(x \in \mathbb{R}: x \geqslant 0\). Examples on ratio data includes weight, volume and temperature in Kelvin. Kelvin approximates an entropy within a system, meaning that it standardizes the amount of heat dissipated or taken by a particular object. So in case we have two cups with different temperature measured in Kelvin, we can argue one being a few times hotter than the other.

A Glimpse Into Probability

Being able to collect data means we can process them to produce an inference of our surrounding. I may overuse (or even tend to abuse) the word “inference” in this post, but please bear with it. We want to make sense of our world using the data we have. For example, suppose we are ordering food which came with cayenne chilli peppers. We knew from previous experience that out of ten times consuming cayenne peppers, three resulted in having a diarrhoea. Now, we would like to know the possibility of having diarrhoea if we proceed to eat this chilli pepper. In essence, we are facing a probability problem, i.e. what is the chance of having a diarrhoea after consuming cayenne chilli pepper?

In less sophisticated manner, we may regard probability as a relative frequency. It means that we are estimating the parameter proportion using statistical approach. If out of ten times we got three occurrences, we are more inclined to say that our chance is roughly 30%. In measuring probability, we need to know the event and its sample space. An event \(E\) is an expected result, whereas the sample space \(S\) is all possible outcomes. Therefore, we can calculate the probability of a particular event as follow:

$$P(E=e) = \frac{E}{S}$$

Now suppose we have a fair coin and we intend to flip it 10 times, where H indicates the head and T the tail. In such a case, our sample space is:

set.seed(1)
S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
|  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"

Let the head be our expected outcomes, so we can list our event as:

E <- S[which(S == "H")] %T>% print()
| [1] "H" "H" "H" "H" "H" "H"

Thus, we can regard the probability of having a desired outcome as a relative frequency of events in a given sample space:

length(E) / length(S)
| [1] 0.6

So, ten flips using a fair coin resulted in a 60% chance of having heads. The method we use so far to determine the probability is an enumeration, i.e. listing all desired events and dividing it to the sample space. However, higher sample space makes it harder to solve, and it is more apparent with a sequential problem. In a sequential problem, we have to calculate the probability from multiple instances, e.g.: having two heads consecutively.

Other methods in estimating the probability is the tree diagram and resampling method. The tree diagram is available to solve a more complex probability problem. Suppose we have an urn evenly filled with identical balls, which only difference is their colours. Let the color be red and blue, which our urn contains 50 and 30 balls of each red and blue, respectively. If we take three balls without replacement, how high is the chance of getting three blue balls? This is the urn problem, a classic from statistics and probability theory. If we were to build a tree diagram, it will look as follow (B stands for blue and R for red):

                        / B (28/78)
               B (29/79)
             /          \ R (50/78)
    B (30/80)
   /         \
  /            R (50/79)
80
  \
   \          
    R (50/80)

Please note it is an incomplete tree diagram. Of course, you may fill out the rest, but we are only interested with the probability of having three blue balls. If we take only the upper branches, representing three consecutive draws of the blue ball, we get the probability of 0.0494. For this type of problem, tree diagram is practically easier than listing all probable outcomes and desired events.

To understand the resampling method, we will conduct a quick experiment, where we roll a dice numerous times. Since we do not physically own a dice –even if we do, I would not suggest doing this experiment physically– we will make a simple function to do the dice roll for us. I will call this function dice. Very creative, huh? ;)

dice <- function(n) {
    sample(1:6, n, replace=TRUE, prob=rep(1/6, 6))
}

We want to know whether our function work:

dice(1)
| [1] 3

It does! Now, we can safely proceed and roll the dice ten times. Let 4 be our outcome of interest. How high is the probability of having the event within 10 trials?

set.seed(1)
roll <- dice(10) %T>% print()
|  [1] 3 4 5 1 3 1 1 5 5 2

Turns out, the probability of getting 4 is 1/10. As we have a fair dice, why is the probability not 1/6? The clue is a sample and the population. We only have a subset of its population. Ideally, we will need to roll the dice \(n\) number of time, with \(n \to \infty\). The more sample we have, the closer it is to represent the population. What will we get with different number of rolls?

roll <- c(10, 100, 1000, 10000, 100000, 1000000, 10000000)
prob <- sapply(roll, function(n) {
    set.seed(1); roll <- dice(n)
    sum(roll==4) / length(roll)
})

Theoretically, the error of estimated probability is inversely proportional to the number of trial, where we can compute as:

$$\epsilon = \sqrt{\frac{\hat{p} (1-\hat{p})}{N}},\ where:$$

\(\epsilon\): Error
\(\hat{p}\): Estimated probability (current trial)
\(N\): Number of resampling

To compute the error in our experiment, we will make a function as follow:

epsilon <- function(p.hat, n) {
    sqrt({p.hat * (1-p.hat)}/n)
}
df <- data.frame(list("roll"=roll, "prob"=prob))
df$error <- mapply(function(p.hat, n) {
    epsilon(p.hat, n)
}, p.hat=df$prob, n=df$roll) %T>% print()
| [1] 0.0948683 0.0433013 0.0126491 0.0037773 0.0011769 0.0003724 0.0001178

With more trials, we get closer to the expected probability in a fair dice, which is 1/6, or equivalently 0.1667.

Following previous theory, the error is decreasing proportionally to the number of sample.

Random Variables

We often heard, or at least will often hear, about independent and identically distributed (I.I.D) random variables. I.I.D simply states all sampled random variables should be independent from one another. Each sampling instance should also follow an identical procedure. With I.I.D, we can do a better probability approximation using:

  • Probability Mass Function (PMF) in discrete variable
  • Probability Density Function (PDF) in continuous variable

The function explained in this section is arbitrary, where we will take a deeper look on most used ones. Mathematically, we can have any probability function as long as it satisfies following rules:

\begin{align} P(E=e) &= f(e) > 0: E \in S \tag{1} \\ \displaystyle \sum_{e \in S} f(e) &= 1 \tag{2} \\ P(E \in A) &= \displaystyle \sum_{e \in A} f(x) \tag{3}: A \subset S \end{align}

Discrete Distribution

In previous subsection explaining probability theory, we have done coin flipping to see how probability works. Coin flipping is a situation where we expect two possible outcomes, in which we desire a certain event to happen. In essence, we have conducted a Bernoulli trial, which is:

  • An experiment with only two outcomes
  • All trials being independent from each other
  • All instances have an identical probability

We shall see some probability functions describing series of Bernoulli event.

Binomial Distribution

Binomial is a type of discrete distribution with an identical iteration over \(n\) times of trial. Each iteration is an independent instances corresponding to a Bernoulli trial. Binomial distribution uses the following PMF, which explain the probability of having desired outcome (an event) within a certain amount of time given \(n\) times of trial.

\begin{align} f(x) &= \binom{n}{x} p^x (1-p)^{n-x} \tag{1} \\ \binom{n}{x} &= \frac{n!}{x! (n-x)!} \end{align}

Or simply denoted as: \(X \sim B(n, p)\)

Knowing the PMF and its parameters, we can calculate the descriptive statistics. From this point onward, this post will only consider explaining mean and standard deviation. We will find their use in comprehending the Central Limit Theorem (CLT).

\begin{align} \mu &= n \cdot p \\ \sigma &= \sqrt{\mu \cdot (1-p)} \end{align}

By changing the parameters, we can see the behaviour of binomial distribution as depicted by following figures:

Geometric Distribution

Suppose we are conducting Bernoulli trials, then we are interested to see the number of trial to conduct before getting an event. In other words, we are looking for a way to calculate the probability of having a particular event after \(k\)-th number of failures. Geometric distribution is a derivation of binomial distribution, where we use \(x=1\) in \(n\) number of trials. So we can formulate geometric distribution as:

$$f(n) = P(X=n) = p (1-p)^{n-1}, with:$$

\(n\): Number of trials to get an event
\(p\): The probability of getting an event
Or simply denoted as \(X \sim G(p)\)

\begin{align} \mu &= \frac{1}{p} \\ \sigma &= \sqrt{\frac{1-p}{p^2}} \end{align}

Using different value of \(p\) as its parameter, we can obtain following probability distribution:

Poisson Distribution

Suppose we know the rate of certain outcomes, then Poisson distribution may define the probability of an outcome happening \(x\) times. The time unit in Poisson distribution is discrete, as we only limit our observation within a particular time frame (observation period). Poisson distribution uses following PMF:

$$f(x) = \frac{e^{-\lambda}\lambda^x}{x!},\ with:$$

\(x\): The number of expected events
\(e\): Euler’s number
\(\lambda\): Average number of events in one time frame

Or simply denoted as \(X \sim P(\lambda)\)

\begin{align} \mu &= \lambda \\ \sigma &= \sqrt{\lambda} \end{align}

Altering the \(\lambda\) in its parameter, we can see how Poisson distribution behave:

Continuous Distribution

So far, we have seen three commonly used distributions in explaining discrete variables, which often came from a Bernoulli trial. Understanding the essence behind probability distribution has tempted us to find a plausible explanation on continuous distributions. However, since we consider continuity, it makes no sense to use PMF, because PMF defines a probability in a given discrete measure. We will look into a probability density function (PDF) and use histogram with density plot to depict probabilities in continuous measure.

Uniform Distribution

As its name suggests, the uniform distribution is a type of continuous distribution which describe (drum roll intensifies) uniform probabilities. We may not find its potential uses in common statistical procedure, but the uniform distribution is a capable tool to generate random numbers. This is useful when you need a properly randomized situation. For example, when you randomly assign patient groups in a clinical trials.

$$f(x) = \frac{1}{b-a}$$

Or simply denoted as \(X \sim U(a,b)\)

\begin{align} \mu &= \frac{b+a}{2} \\ \sigma &= \frac{(b-a)^2}{12} \end{align}

Exponential Distribution

Previously, we used Poisson distribution to find the probability of having \(n\) number of events given a particular occurrence rate. If we are interested in finding the minimum required time to observe an event, we shall re-parameterize Poisson distribution. Exponential PDF calculate the probability of spending a certain amount of time to observe an event in a Poisson process. Please be advised though, as it is not always about the time. Time frame is an intangible definition, as it could be other continuous measures in a Poisson process such as mileage, weight, volume, etc.

$$f(x) = \lambda e^{-x \lambda}, with:$$

\(x\): Time needed to observe an event
\(\lambda\): The rate for a certain event

Or simply denoted as \(X \sim Exponential(\lambda)\)

$$\mu = \sigma = \frac{1}{\lambda}$$

Changing the \(\lambda\) parameters may result in following figure:

Gamma Distribution

Exponential distribution is a good approximate to surmise the probability on spending a particular amount of time for one event to occur. However, when we are interested to find the probability over numerous amount of events, then Gamma PDF can model a more suitable distribution. Gamma PDF only differs to the exponential PDF with respect to its shape parameter. When we set the shape \(\alpha=1\) in Gamma PDF, we can get the exponential PDF.

\begin{align} f(x) &= \frac{\beta^\alpha}{\Gamma(\alpha)}x^{\alpha-1}e^{-x \beta} \\ \Gamma(\alpha) &= \displaystyle \int_0^\infty y^{\alpha -1} e^{-y}\ dy,\ with: \end{align}

\(\beta\): Rate ( \(\lambda\) in exponential PDF)
\(\alpha\): Shape
\(\Gamma\): Gamma function
\(e\): Euler number

Or simply denoted as \(X \sim \Gamma(\alpha, \beta)\)

\begin{align} \mu &= \frac{\alpha}{\beta} \\ \sigma &= \frac{\sqrt{\alpha}}{\beta} \end{align}

In exponential distribution, we observed how the distribution changes over different \(\lambda\) values. Since the effect of rate \(\beta\) parameter is somewhat similar in Gamma PDF, we will see how the distribution looks over different values of shape \(\alpha\) parameter.

\(\chi^2\) Distributions

As special cases of Gamma distribution, \(\chi^2\) PDF defines a family of distributions widely used in statistical inferences. This post will not go in a full length to describe the \(\chi^2\) distributions. Instead, we shall take a look at the general form and later see its application on upcoming lectures.

$$f(x) = \frac{1}{\Gamma (k/2) 2^{k/2}} x^{k/2 - 1} e^{-x/2},\ with:$$

\(k\): Degree of freedom
The rest are Gamma PDF derivations

Or simply denoted as \(X \sim \chi^2(k)\)

\begin{align} \mu &= k \\ \sigma &= \sqrt{2k} \end{align}

Normal Distribution

Normal PDF is arguably the most ubiquitous probability function in modelling real-world data. It is a symmetric distribution where \(\mu\) and \(\sigma\) completely describe the probability.

$$f(x) = \frac{1}{\sigma \sqrt{2\pi}}exp \bigg\{ -\frac12 \bigg( \frac{x-\mu}{\sigma} \bigg)^2 \bigg\},\ with:$$

\(x \in \mathbb{R}: -\infty < x < \infty\)
\(\mu \in \mathbb{R}: -\infty < \mu < \infty\)
\(\sigma \in \mathbb{R}: 0 < \sigma < \infty\)

Or simply denoted as \(X \sim N(\mu, \sigma)\)

By alternating the parameter \(\mu\), we may observe following instances:

While changing the parameter \(\sigma\) will result as follow:

Goodness of Fit Test

To determine whether our data follow a particular distribution, we ought to conduct a statistical analysis called goodness of fit test. Numerous methods exist, but we will dig into more popular ones. Given correct parameters, some methods can fully describe the data. Since we consider doing a statistical analysis, it is always a good practice to clearly state the hypothesis:

  • \(H_0\): Given data follow a certain distribution
  • \(H_1\): Given data does not follow a certain distribution

Binomial Test

As a goodness of fit measure, binomial test is an adaptation of binomial PMF. It determines whether acquired relative frequency is following after Bernoulli trial’s probability. First we need to compute the probability using given parameters:

$$Pr(X=k) = \binom{n}{k}p^k (1-p)^{n-k}$$

Remember we previously tossed a coin 10 times? We will use that example to perform a binomial test.

set.seed(1)
S <- sample(c("H", "T"), 10, replace=TRUE, prob=rep(1/2, 2)) %T>% print()
|  [1] "T" "T" "H" "H" "T" "H" "H" "H" "H" "T"
length(E) / length(S)
| [1] 0.6

If it corresponds with a Bernoulli trial, it should satisfy \(P(X=6)\) in such a way that we cannot reject the \(H_0\) when calculating its probability:

$$P(X=6) = \binom{10}{6}0.5^6(1-0.5)^4$$

Luckily, we do not need to compute it by hand (yet). We can use following function in R.

binom.test(x=6, n=10, p=0.5)
| 
|   Exact binomial test
| 
| data:  6 and 10
| number of successes = 6, number of trials = 10, p-value = 0.8
| alternative hypothesis: true probability of success is not equal to 0.5
| 95 percent confidence interval:
|  0.2624 0.8784
| sample estimates:
| probability of success 
|                    0.6

Interpreting the p-value, we cannot reject the \(H_0\), so our coin toss followed a Bernoulli trial’s probability after all.

Kolmogorov-Smirnov Test

Often referred as KS test, it is available to determine various distributions using a non-parametric method. Power-wise, it is pretty much robust and comparable to the Anderson-Darling test and Shapiro-Wilk on normal distribution. In following example, we will demonstrate how to conduct a KS test over an exponential distribution.

Let \(X \sim Exponential(2): n = 100\)

set.seed(1)
X <- rexp(n=100, rate=2)

By imputing \(\lambda\) variable, Kolmogorov-Smirnov can compute its goodness of fit

ks.test(X, pexp, rate=2)
| 
|   Asymptotic one-sample Kolmogorov-Smirnov test
| 
| data:  X
| D = 0.084, p-value = 0.5
| alternative hypothesis: two-sided

With p-value > 0.05, we can confidently keep the \(H_0\), that our data follow a certain distribution. In this case, \(X \sim Exponential(2)\).

Visual Examination

Conducting a numerical analysis gives us various advantage when reporting our results. However, in a large sample, even a small deviation will result in \(H_0\) rejection. Which mean, previously mentioned tests are of no use! Still, we can rely on some visual cues to determine the distribution. For this demonstration we will again use the previous object of \(X\).

That is a good start! Sadly, this does not give a clear indication on which distribution it follows.

Quantile-Quantile Plot (QQ Plot) can give a better visual cue :)

Test of Normality

Normality test is practically a subset of goodness of fit test, as it only describes the normal distribution. Some tests are more appropriate under certain circumstances. Following sections will look at widely used normality tests in statistics. As always, we will begin by stating the hypothesis:

  • \(H_0\): Sample follows the normal distribution
  • \(H_0\): Sample does not follow the normal distribution

For demonstration purpose, we will populate normally-distributed data into an object called \(X\), where we let \(X \sim N(0, 1): n=100\).

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

Shapiro-Wilk Test

Shapiro-Wilk is a well-established normality test to assess (ahem) normality. It can tolerate skewness to a certain degree. Its implementation in R has a constraint, where the sample size of \(n\) should be $3 \leqslant n \leqslant 5000$.

shapiro.test(X)
| 
|   Shapiro-Wilk normality test
| 
| data:  X
| W = 1, p-value = 1

Anderson-Darling Test

Compared to Shapiro-Wilk and Kolmogorov-Smirnov, Anderson-Darling test is less well known. However, we ought to know this test because Anderson-Darling gives more weight to the tails when testing for normality. It means, some data with higher frequencies on its tails, may receive better justification using Anderson-Darling test. Anderson-Darling implementation in R requires the package nortest, and it needs at minimum a sample size \(n\) of 7.

nortest::ad.test(X)
| 
|   Anderson-Darling normality test
| 
| data:  X
| A = 0.16, p-value = 0.9

Visual Examination

\(\chi^2\) and Normal Distribution

We previously mentioned \(\chi^2\) distributions as special cases of Gamma distribution. It turned out, if we raise a normally-distributed data to the power of 2, it behaves as a \(\chi^2\) distribution with 1 degree of freedom. The following example on \(X \sim N(0,1): n=100\) will demonstrate how $X^2 \sim \chi^2(1)$. We previously tested \(X\) against Shapiro-Wilk and Anderson-Darling test to indicate normality.

set.seed(1)
X <- rnorm(n=100, mean=0, sd=1)

First, we raised \(X\) to the power of two.

X2 <- X^2
shapiro.test(X2)
| 
|   Shapiro-Wilk normality test
| 
| data:  X2
| W = 0.7, p-value = 5e-13

Shapiro-Wilk and visual examination suggested \(X^2\) does not follow a normal distribution. But, does it follow a \(\chi^2\) distribution?

ks.test(X2, pchisq, df=1)
| 
|   Asymptotic one-sample Kolmogorov-Smirnov test
| 
| data:  X2
| D = 0.1, p-value = 0.2
| alternative hypothesis: two-sided

So we can safely conclude that raising a normal distribution to the power of two resulted in a \(\chi^2\) distribution with a parameter of \(k=1\).

Central Limit Theorem

So far, we have learnt sampling distributions, where we see different implementation on probability functions in discrete and continuous variables. We are also able to compute the mean and standard deviation based on their parameters. It just happened that the sample mean follows a normal distribution ;) Central Limit Theorem (CLT) delineates such an occurrence, applicable to both discrete and continuous distributions.

$$\bar{X} \xrightarrow{d} N \bigg(\mu, \frac{\sigma}{\sqrt{n}} \bigg) as\ n \to \infty$$

The equation above basically tells us that the sample mean distribution \(\bar{X}\) have a convergence upon random variables to follow a normal distribution with mean of \(\mu\) and standard deviation of \(\frac{\sigma^2}{n}\). Each \(\mu\) and \(\sigma\) corresponds to computed values acquired from the sampled distribution.

The caveat (or the perk) with CLT is we need a sufficiently large sample number of \(n\). The number of sample depends on data skewness, where more skewed data requires a higher value of \(n\). We can determine \(n\) using a simple simulation, using following steps:

  1. Choose any distribution
  2. Generate \(n\) random numbers using specified parameters
  3. Compute the mean and variance based on previous parameters \(\to\) Use it to generate a normal distribution
  4. Reuse the parameters to re-iterate step 2
  5. Conduct the simulation for an arbitrary number of times (e.g. for convenience, 1000)
  6. Calculate mean from all generated data \(\to\) Make a histogram and compare it with step 3
  7. Does not fit normal distribution? \(\to\) Increase \(n\)

Why should we care about CLT if we can do statistical inferences without it? In a research settings, we may find a differing average values despite following the exact procedure. It could be frustrating! Knowing CLT, we can prove the difference is indeed within expectation. As a final excerpt, CLT describes a tendency of a mean to follow normal distribution, but it requires a sufficient number of \(n\).