Slide

Population and Parameters

We often regard the society we are living in as a population, but what is exactly defining a population? In a layman term, we may surmise population as all observable subjects inhabiting a certain location. In research context, this interpretation may not present as the least imperfect definition, but we can generalize the idea of “subjects” into entities of interest. Parameters are quantitative summaries, where we derive a general value in interpreting certain aspects of such a population.

However, collecting data from the population is an arduous task. We often left with an option to only gather specific information from a subset of population, i.e. a sample. Samples consider statistics as an approximation of parameters, which we will convey in further sections. Descriptively, both statistics and parameters may include, but not limited to, data element, number of element, proportion, median, average, standard deviation, variance and correlation coefficient.

Statistics Meanings Parameters
\(x\) Data element \(X\)
\(n\) Number of element \(N\)
\(p\) Proportion \(P\)
\(m\) Median \(M\)
\(\bar{x}\) Average \(\mu\)
\(s\) Standard deviation \(\sigma\)
\(s^2\) Variance \(\sigma^2\)
\(r\) Correlation coefficien t \(\rho\)

(#tab:tbl.param)Differences in parameters and statistics

Data Element

Suppose we have height data from a particular student group, where we refer as \(X\). Our data comprises of an array of many elements, where \(X \ni \{x_1, x_2, ..., x_n\}\). As denoted by the indices, each of \(\{x_1, x_2, ..., x_n\}\) is the corresponding element of the data. To ease our imagination, we will generate a set of random numbers in aiding further explanation.

set.seed(1)
X <- rnorm(10, mean=160, sd=10)
print(X)
|  [1] 153.7 161.8 151.6 176.0 163.3 151.8 164.9 167.4 165.8 156.9

We can refer each data point according to its index. If we were to extract a particular element, e.g. the 7th element from \(X\), we can do so by referring \(x_7\), as such:

print(X[7])
| [1] 164.9

Understanding what data are and how we transform data into a mathematical array is an important aspect to comprehend how statistics work in general.

Number of Element

Previous section has introduced us into the element of the data. Now we understand \(x_{1...n} \in X\), which translates to \(x\) with indices of ${1, 2, …, n}$ as elements of \(X\). We already knew what \(x\) and \(X\) is, but what about \(n\)? It turns out \(n\) is the total number of element. In R, we may know the exact number of element by running following code:

length(X)
| [1] 10

Proportion

By knowing the total number of element, we can infer something more exciting from our data. Suppose we have a threshold of 165, where a subject whose height taller than 165 cm considered as being tall. We can find a proportion of tall students compared to (ahem) not-so-tall students. We can simply regard $p = \frac{g}{n},\ where:$
\(p\): Proportion
\(g\): Group of interest
\(n\): Number of element

So, how many tall students do we have in our data?

set.seed(1)
X <- rnorm(10, mean=160, sd=10)
print(X)
|  [1] 153.7 161.8 151.6 176.0 163.3 151.8 164.9 167.4 165.8 156.9
sum(X > 165) / length(X)
| [1] 0.3

Mean

In calculating the average value, we assume data presenting an even distribution. It means that, our data should have a roughly equal number of observation on both extreme ends (tails). We can calculate the average value as a general description of the given data using following equation:

$$\bar{x} = \frac{1}{n}\displaystyle \sum_{i=1}^{n} x_i$$

Of course, we can manually get the mean value in R by instructing:

sum(X) / length(X)
| [1] 161.3

However, R has a nice built-in function to perform the calculation on our behalf. To calculate the mean, we need to use -wait for it- mean.

mean(X)
| [1] 161.3

The problem is, not all data distributed evenly. A quick glance on our histogram and density plot, revealed following figure:

As we observed, some bins in the histogram presented with a higher probability density. We can interpret that the data does not follow an even distribution. As a solution, we may need to use another measure to infer the general property of our data, i.e. by using median.

Median

In finding the median, first we need to sort the data, either ascendingly or descendingly. Often we use ascending sorting, since it is the default behaviour expected in sort function, and also easier for us to understand an ascending order. After sorting the data, we will need to find the middle index, which we will regard as our median. Mathematically, we can formulate the median as:

$$m = \begin{cases} x_{\frac{n+1}{2}} & :n_{2\nmid} \\ \frac{1}{2} \left(x_{\frac{n}{2}} + x_{\frac{n}{2} + 1} \right) & :n_{2\mid} \end{cases}$$

To visualize the process, let us demonstrate the sorting algorithm:

sort(X)
|  [1] 151.6 151.8 153.7 156.9 161.8 163.3 164.9 165.8 167.4 176.0

Then computing the median location:

median(X)
| [1] 162.6

Standard Deviation

In a simple term, deviation tells us how far a data element \(x_i\) is from our mean value \(\mu\). So, we can parametrically calculate deviation as $d_i = x_i - \mu$. However, the deviation can be either negative or positive, as shown in following demonstration.

set.seed(1)
X <- rnorm(10, mean=160, sd=10)
print(X)
|  [1] 153.7 161.8 151.6 176.0 163.3 151.8 164.9 167.4 165.8 156.9
d <- X - mean(X)
print(d, digits=2)
|  [1] -7.59  0.51 -9.68 14.63  1.97 -9.53  3.55  6.06  4.44 -4.38

The presence of both values makes it harder for us to find their general property! It would not make sense to find the mean nor median in such cases, since we need an absolute distance as our value of deviation. We can potentially solve the issue by taking only the absolute value by:

$$\bar{d} = \frac{1}{N} \displaystyle \sum_{i=1}^{N} |X_i - \mu|$$

So, we may re-calculate the deviation as:

d <- abs(X - mean(X))
print(d, digits=2)
|  [1]  7.59  0.51  9.68 14.63  1.97  9.53  3.55  6.06  4.44  4.38
d.bar <- mean(d)
print(d.bar, digits=2)
| [1] 6.2

Now, it’s easier to report our findings as \(\bar{x} \pm \bar{d}\), or numerically as 161.32 \(\pm\) 6.23. Yet, such a practice is uncommon to see.

A more robust alternative to calculate deviation is by finding the root-mean square, which define a standard deviation:

$$\sigma = \sqrt{\frac{1}{N} \displaystyle \sum_{i=1}^N (X_i - \mu)^2}$$

Transforming the equation into R code, we obtain following value:

std.dev <- sqrt(sum({X - mean(X)}^2) / length(X))
print(std.dev)
| [1] 7.4

Pleased be advised, thus far we only calculate the parametric value of standard deviation. Since we only have a sample, we need to estimate the actual value by minimizing our bias. In doing so, we ought to apply Bessel’s correction. In a simple term, Bessel’s method will require a division to $n -1$ instead of \(N\), in a way that produce:

$$s = \sqrt{\frac{1}{n-1} \displaystyle \sum_{i=1}^n (x_i - \bar{x})^2}$$

Quickly run the code in our R session, we will obtain following result:

std.dev <- sqrt(sum({X - mean(X)}^2) / {length(X) - 1})
print(std.dev)
| [1] 7.8
sd(X) # Built-in function to calculate standard deviation
| [1] 7.8

Please notice a slight difference prior to applying Bessel’s correction.

Variance

Variance is another measure to estimate deviation. Akin to standard deviation, variance will denote mean-squared differences from each data point to the mean value. However, it does not consider root square function, thus annotated as \(\sigma^2\). To obtain the variance in sample data, we may use following equation:

$$s^2 = \frac{1}{n-1} \displaystyle \sum_{i=1}^n (x_i - \bar{x})^2$$

Finding variance is important to make further inference in a more advanced statistical model, i.e. analysis of variance (ANOVA) or factor analysis. Such a topic is beyond the scope of current post, so we will look at it on future occasion.

Quantile

Quantile is another measure of spread, where you assign each data element into a corresponding group. It creates a cut-off from a given probability distribution and divide the data into a continuous range. This method requires a sorted data, where it will segregate the data based on its order.

Our ordered data:

sort(X)
|  [1] 152 152 154 157 162 163 165 166 167 176

If we were to separate it into a quantile of 0.2, we will get groups as follow:

quantile(X, probs=seq(0, 1, 1/5))
|   0%  20%  40%  60%  80% 100% 
|  152  153  160  164  166  176

Quantile can take any value of \(q \in \rm I\!R: 0 \leqslant q \leqslant 1\). Quartile is a special case of quantile, where it considers $q = {0.25, 0.5, 0.75}$ to group the data.

quantile(X, probs=seq(0, 1, 1/4))
|   0%  25%  50%  75% 100% 
|  152  155  163  166  176

Quartile is an important measure in descriptive statistics. In reporting the mean, we usually state \(\bar{x} \pm s\). Meanwhile, we may report a median alongside its interquartile range as \(m\ (IQ1 - IQ3)\). \(IQ1\) stands for the interquartile 1 (quantile 0.25) as \(IQ3\) for the interquartile 3 (quantile 0.75).

Conclusion

We can describe the data by looking into its central tendency and spread. Mean and median are common measures of central tendency, while we can use standard deviation, variance and quantile to measure the spread.