count: false class: bg-main1 split-70 hide-slide-number .column[.vmiddle.right.content[ .font3[.amber[Correlation] of Numeric Variables] ]] .column.bg-main4[.vmiddle.content[ .amber[Aly Lamuri] Indonesia Medical Education and Research Institute ]] --- name: overview layout: true class: bg-main4 split-30 hide-slide-number .column[.vmiddle.right.content[ .font3.amber[Overview] ]] --- template: overview count: false .column.bg-main1[.vmiddle.content[ - .amber[Covariance] - Pearson's `\(r\)` - Spearman's `\(\rho\)` - Kendall's `\(\tau\)` ]] --- layout: true class: bg-main3 # Covariance --- .font2[ - Concept recall: variance - Describes a trend between two .amber[numeric] variables - Does not define the magnitude - How does `\(y\)` behave if we know the value of `\(x\)`? ] --- count: false .font2[ `$$\sigma_{x, y} = \frac{\displaystyle \sum_{i=1}^n(x_i - \mu_x)(y_i - \mu_y)}{n}$$` ] ??? - Concept recall: Bias and Bessel's correction -- .font2[ `$$s_{x, y} = \frac{\displaystyle \sum_{i=1}^n(x_i - \color{orange}{\bar{x}}) (y_i - \color{orange}{\bar{y}})}{(\color{orange}{n-1})}$$` ] --- layout: false class: bg-main3 # Covariance matrix .font2[ - Pairwise relationships between multiple numeric variables - Assessing trends at a glimpse - A useful descriptive statistics before designing a complex model ] --- layout: true class: bg-main3 # Example, please? --- ```r tbl <- subset(iris, select=c(Sepal.Width, Sepal.Length)) %>% str() ``` ``` ## 'data.frame': 150 obs. of 2 variables: ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ``` -- .font2[ - We will calculate how `Sepal.Width` covary with `Sepal.Length` - From here onwards, we will set `x` to represent the width - ...and `y` to represent the length ] --- count: false .bg-white[ <br>
<br> ] --- ```r covariance <- function(x, y) { n <- length(x) # Length of x must be = length of y {(x - mean(x)) * (y - mean(y))} %>% sum() %>% divide_by(n-1) } ``` .font2[ - This function will help us calculating the covariance - Notice how it forms a computational sequence? ] -- ```r covariance(tbl$x, tbl$y) ``` ``` ## [1] -0.042 ``` ```r cov(tbl$x, tbl$y) # Built-in function ``` ``` ## [1] -0.042 ``` --- .font2[How if we calculate covariances of the same variable?] -- ```r covariance(tbl$x, tbl$x) ``` ``` ## [1] 0.69 ``` ```r var(tbl$x) # Variance of x ``` ``` ## [1] 0.69 ``` ??? - Covariance of one variable is the **variance** -- .font2[ `$$s_{x, x} = \frac{\displaystyle \sum_{i=1}^n(x_i - \color{orange}{\bar{x}}) (x_i - \color{orange}{\bar{x}})}{(\color{orange}{n-1})}$$` ] --- ```r tbl <- subset(iris, select=-Species) %T>% str() ``` ``` ## 'data.frame': 150 obs. of 4 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ``` ```r cov(tbl) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 0.686 -0.042 1.27 0.52 ## Sepal.Width -0.042 0.190 -0.33 -0.12 ## Petal.Length 1.274 -0.330 3.12 1.30 ## Petal.Width 0.516 -0.122 1.30 0.58 ``` ??? - This provides a splendid example on covariance matrix --- template: overview count: false .column.bg-main1[.vmiddle.content[ - Covariance - .amber[Pearson's] `\(r\)` - Spearman's `\(\rho\)` - Kendall's `\(\tau\)` ]] --- layout: true class: bg-main3 # Pearson's `\(r\)` --- .font2[ - Moment product correlation - Describes the trend - Also the .amber[magnitude] - Dimension free ] --- count: false .font2[ `\begin{align} r &= \frac{\color{orange}{s_{x,y}}}{s_x \cdot s_y} \\ &= \displaystyle \sum_{i=1}^n \frac{\color{orange}{(x-\bar{x}) (y-\bar{y})}} {\color{orange}{(n-1)} \cdot s_x \cdot s_y} \\ &= \displaystyle \sum_{i=1}^n \frac{\big( \frac{x-\bar{x}}{s_x} \big) \cdot \big( \frac{y-\bar{y}}{s_y} \big)}{n-1} \end{align}` ] ??? - Concept recall: Z-score --- .font2[ `\begin{align} r &= \frac{Z_x \cdot Z_y}{n-1} \\ \nu &= n - 2 \tag{DoF} \end{align}` ] -- .font2[ - It describes the relationship between .amber[two] numeric variables - Both variables needs to follow a normal distribution - Recall: `\(Z \sim N(0, 1)\)` - Since `\(r \sim Z \to r\)` does not care for the unit! ] --- .font2[ `$$t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}}$$` - `\(t \sim T(\nu)\)` - There exists another method of determining the significance ] --- layout: false class: bg-main3 # Assumptions .font2[ - I.I.D - Univariate normality - .amber[Bivariate] normality - Has a linear relationship ] ??? - Important concept: joint distribution - When the data follows a bivariate normal distribution, Pearson's `\(r\)` can completely describe the relationship - However, bivariate normality is not a stringent assumption per se - Could not address non-linearity -- ## Hypotheses .font2[ - `\(H_0\)`: Both variables do not have a linear relationship - `\(H_1\)`: Both variables have a linear relationship ] --- layout: true class: bg-main3 # Example, please? --- ```r lapply(tbl, shapiro.test) %>% lapply(broom::tidy) %>% lapply(data.frame) %>% {do.call(rbind, .)} %>% kable() %>% kable_minimal() ``` <table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> <th style="text-align:left;"> method </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Sepal.Length </td> <td style="text-align:right;"> 0.98 </td> <td style="text-align:right;"> 0.01 </td> <td style="text-align:left;"> Shapiro-Wilk normality test </td> </tr> <tr> <td style="text-align:left;"> Sepal.Width </td> <td style="text-align:right;"> 0.98 </td> <td style="text-align:right;"> 0.10 </td> <td style="text-align:left;"> Shapiro-Wilk normality test </td> </tr> <tr> <td style="text-align:left;"> Petal.Length </td> <td style="text-align:right;"> 0.88 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:left;"> Shapiro-Wilk normality test </td> </tr> <tr> <td style="text-align:left;"> Petal.Width </td> <td style="text-align:right;"> 0.90 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:left;"> Shapiro-Wilk normality test </td> </tr> </tbody> </table> --- count: false <img src="index_files/figure-html/pearson2-1.png" width="90%" /> ??? - Sepal width follows a normal distribution - Sepal length *closely* follow a normal distribution - Not many normality violations in sepal length (checked using qqplot) - We shall see whether our data follow a bivariate normal distribution --- count: false ```r subset(tbl, select=c(Sepal.Length, Sepal.Width)) %>% MVN::mvn() # Multivariate normality ``` ``` ## $multivariateNormality ## Test Statistic p value Result ## 1 Mardia Skewness 9.46144098216623 0.0505456076692465 YES ## 2 Mardia Kurtosis -0.853178029438543 0.393560585232763 YES ## 3 MVN <NA> <NA> YES ## ## $univariateNormality ## Test Variable Statistic p value Normality ## 1 Shapiro-Wilk Sepal.Length 0.98 0.01 NO ## 2 Shapiro-Wilk Sepal.Width 0.98 0.10 YES ## ## $Descriptives ## n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis ## Sepal.Length 150 5.8 0.83 5.8 4.3 7.9 5.1 6.4 0.31 -0.61 ## Sepal.Width 150 3.1 0.44 3.0 2.0 4.4 2.8 3.3 0.31 0.14 ``` ??? - Multivariate normality test is a general form of measuring bivariate normality - We use Mardia's test for this purpose --- ```r cor.test(tbl$Sepal.Length, tbl$Sepal.Width) ``` ``` ## ## Pearson's product-moment correlation ## ## data: tbl$Sepal.Length and tbl$Sepal.Width ## t = -1, df = 148, p-value = 0.2 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.273 0.044 ## sample estimates: ## cor ## -0.12 ``` -- .font2[ - `\(\color{red}{-1} \leq r \leq \color{orange}{1}\)` - .red[Negative] and .orange[positive] trends ] --- count: false <img src="index_files/figure-html/pearson4-1.png" width="90%" /> --- template: overview count: false .column.bg-main1[.vmiddle.content[ - Covariance - Pearson's `\(r\)` - .amber[Spearman's] `\(\rho\)` - Kendall's `\(\tau\)` ]] --- layout: true class: bg-main3 # Spearman's `\(\rho\)` --- .font2[ - A non-parametric variant of Pearson's `\(r\)` - Suitable to handle ordinal data - In some cases: applicable for non-normally distributed numeric data - Not sufficient to correctly handle tied values ] --- count: false .font2[ `\begin{align} \rho &= 1 - \frac{6 \sum (R_x - R_y)^2}{n (n^2 - 1)} \\ \nu &= n - 2 \tag{DoF} \end{align}` ] -- .font2[ - `\(R_{x,y}\)` is the rank for `\(X, Y\)` - Ranking follows an order within one variable, i.e. .amber[not] by pooling the data - By assigning rank, we can address non-linearity to a certain degree ] ??? - As an alternative to this equation, we can use Pearson's `\(r\)` - But we need to use the rank instead of the actual data element --- count: false .font2[ `$$t = \frac{\rho}{\sqrt{\frac{1-\rho^2}{n-2}}}$$` - `\(t \sim T(\nu)\)` - Handle ties by taking the average value of ranks - Tie `\(\to\)` Has little confidence in determining the p-value ] --- layout: false class: bg-main3 # Assumptions .font2[ - I.I.D - Monotonic trend - Has a natural order ] --- layout: true class: bg-main3 # Example, please? --- ## .red[Disclaimer!] .font2[ - This example is only for an illustrative purpose - We will re-use a subset on the `iris` dataset ] --- count: false ```r tbl <- subset(iris, select=-Species) %T>% str() ``` ``` ## 'data.frame': 150 obs. of 4 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ``` ```r cov(tbl) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 0.686 -0.042 1.27 0.52 ## Sepal.Width -0.042 0.190 -0.33 -0.12 ## Petal.Length 1.274 -0.330 3.12 1.30 ## Petal.Width 0.516 -0.122 1.30 0.58 ``` --- count: false ```r cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="spearman") ``` ``` ## ## Spearman's rank correlation rho ## ## data: tbl$Sepal.Length and tbl$Sepal.Width ## S = 7e+05, p-value = 0.04 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## -0.17 ``` -- .font2[ - `\(\color{red}{-1} \leq \rho \leq \color{orange}{1}\)` - .red[Negative] and .orange[positive] trends ] --- template: overview count: false .column.bg-main1[.vmiddle.content[ - Covariance - Pearson's `\(r\)` - Spearman's `\(\rho\)` - .amber[Kendall's] `\(\tau\)` ]] --- layout: true class: bg-main3 # Kendall's `\(\tau\)` --- .font2[ - Non-parametric - Methods: `\(\tau_a, \tau_b, \tau_c\)` - Concordant and discordant pairs ] ??? - `\(\tau_a\)`: Square table - `\(\tau_b\)`: Square table, handles tie - `\(\tau_c\)`: Rectangular table, handles tie - Most applicable on an ordinal data --- count: false .font2[ - For `\(i, j \in X, Y: i \neq j,\ \exists\ (x_{i, j}, y_{i, j})\)` - Concordant: `\((x_i < x_j \ \texttt{and}\ y_i < y_j) \lor (x_i > x_j \ \texttt{and}\ y_i > y_j)\)` - Discordant: `\((x_i < x_j \ \texttt{and}\ y_i \nless y_j) \lor (x_i > x_j \ \texttt{and}\ y_i \ngtr y_j)\)` ] ??? - Concordant: pairs with similar symbols - Discordant: pairs with dissimilar symbols --- count: false .font2[ `\begin{align} \tau_a &= \frac{n_c - n_d}{n}\\ \tau_b &= \frac{n_c - n_d}{\sqrt{(n + X_0) (n + Y_0)}} \\ \tau_c &= \frac{2(n_c - n_d)}{n^2 \frac{(m-1)}{m}} \\ n &= \binom{n}{2} \end{align}` ] ??? - Square table: both variables are ordinal with the same scale - Rectangular table: both variables have different measurement scales - `\(n_c\)`: Number of concordant pairs - `\(n_d\)`: Number of discordant pairs - `\(n\)`: Total number of possible pairs - `\(m\)`: `\(min(r, c): r\)` is the row and `\(c\)` is the column - `\(X_0, Y_0\)`: Ties in either X or Y - Most statistical software employs Kendall's `\(\tau_b\)` --- layout: true class: bg-main3 # Example, please? --- ```r tbl <- subset(iris, select=-Species) %T>% str() ``` ``` ## 'data.frame': 150 obs. of 4 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ``` ```r cov(tbl) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Sepal.Length 0.686 -0.042 1.27 0.52 ## Sepal.Width -0.042 0.190 -0.33 -0.12 ## Petal.Length 1.274 -0.330 3.12 1.30 ## Petal.Width 0.516 -0.122 1.30 0.58 ``` --- count: false ```r cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="kendall") ``` ``` ## ## Kendall's rank correlation tau ## ## data: tbl$Sepal.Length and tbl$Sepal.Width ## z = -1, p-value = 0.2 ## alternative hypothesis: true tau is not equal to 0 ## sample estimates: ## tau ## -0.077 ``` -- .font2[ - `\(\color{orange}{0} \leq \tau \leq \color{orange}{1}\)` - Interpret the absolute value of `\(\tau\)` - Base `R` only implements `\(\tau_a\)`, other methods exist in a specific packages ] --- layout: false class: bg-main2 # Recap .font2[ - Check normality - Check linearity - Non-parametric test: determine the presence of tie - Perform correlation - Create the plot (if necessary) ] --- class: bg-main2 # Caveats .font2[ - We only discussed .orange[*some*] of the popular correlation test - All discussed methods assume .orange[I.I.D] - .orange[Paired data] is suitable for .orange[none] of discussed methods - .orange[Time series] data requires a different approach - Correlation `\(\color{orange}{\neq}\)` Causation ] --- class: bg-main2 # Is that all? ??? Short answer: no. -- .font2[ - Concordance correlation coefficient - Intraclass correlation - Partial correlation - Zero-order correlation - The list goes on... ] --- count: false class: bg-main1 middle center font5 hide-slide-number .amber[Query?]