Correlation of Numeric Variables

.column.bg-main4[.vmiddle.content[
.amber[Aly Lamuri]  
Indonesia Medical Education and Research Institute
]]

---

---

.column.bg-main1[.vmiddle.content[
- .amber[Covariance]
- Pearson's `$r$`
- Spearman's `$\rho$`
- Kendall's `$\tau$`
]]

---

# Covariance

---

.font2[
- Concept recall: variance
- Describes a trend between two .amber[numeric] variables
- Does not define the magnitude
- How does `$y$` behave if we know the value of `$x$`?
]

---

???

- Concept recall: Bias and Bessel's correction

.font2[
`$$s_{x, y} = \frac{\displaystyle \sum_{i=1}^n(x_i - \color{orange}{\bar{x}}) (y_i - \color{orange}{\bar{y}})}{(\color{orange}{n-1})}$$`
]

---

# Covariance matrix

.font2[
- Pairwise relationships between multiple numeric variables
- Assessing trends at a glimpse
- A useful descriptive statistics before designing a complex model
]

---

# Example, please?

---

```r
tbl <- subset(iris, select=c(Sepal.Width, Sepal.Length)) %>% str()
```

```
## 'data.frame':	150 obs. of  2 variables:
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
```

.font2[
- We will calculate how `Sepal.Width` covary with `Sepal.Length`
- From here onwards, we will set `x` to represent the width
- ...and `y` to represent the length
]

---

<div id="htmlwidget-c46c38adc5ed980156ff" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-c46c38adc5ed980156ff">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50","51","52","53","54","55","56","57","58","59","60","61","62","63","64","65","66","67","68","69","70","71","72","73","74","75","76","77","78","79","80","81","82","83","84","85","86","87","88","89","90","91","92","93","94","95","96","97","98","99","100","101","102","103","104","105","106","107","108","109","110","111","112","113","114","115","116","117","118","119","120","121","122","123","124","125","126","127","128","129","130","131","132","133","134","135","136","137","138","139","140","141","142","143","144","145","146","147","148","149","150"],[5.1,4.9,4.7,4.6,5,5.4,4.6,5,4.4,4.9,5.4,4.8,4.8,4.3,5.8,5.7,5.4,5.1,5.7,5.1,5.4,5.1,4.6,5.1,4.8,5,5,5.2,5.2,4.7,4.8,5.4,5.2,5.5,4.9,5,5.5,4.9,4.4,5.1,5,4.5,4.4,5,5.1,4.8,5.1,4.6,5.3,5,7,6.4,6.9,5.5,6.5,5.7,6.3,4.9,6.6,5.2,5,5.9,6,6.1,5.6,6.7,5.6,5.8,6.2,5.6,5.9,6.1,6.3,6.1,6.4,6.6,6.8,6.7,6,5.7,5.5,5.5,5.8,6,5.4,6,6.7,6.3,5.6,5.5,5.5,6.1,5.8,5,5.6,5.7,5.7,6.2,5.1,5.7,6.3,5.8,7.1,6.3,6.5,7.6,4.9,7.3,6.7,7.2,6.5,6.4,6.8,5.7,5.8,6.4,6.5,7.7,7.7,6,6.9,5.6,7.7,6.3,6.7,7.2,6.2,6.1,6.4,7.2,7.4,7.9,6.4,6.3,6.1,7.7,6.3,6.4,6,6.9,6.7,6.9,5.8,6.8,6.7,6.7,6.3,6.5,6.2,5.9],[3.5,3,3.2,3.1,3.6,3.9,3.4,3.4,2.9,3.1,3.7,3.4,3,3,4,4.4,3.9,3.5,3.8,3.8,3.4,3.7,3.6,3.3,3.4,3,3.4,3.5,3.4,3.2,3.1,3.4,4.1,4.2,3.1,3.2,3.5,3.6,3,3.4,3.5,2.3,3.2,3.5,3.8,3,3.8,3.2,3.7,3.3,3.2,3.2,3.1,2.3,2.8,2.8,3.3,2.4,2.9,2.7,2,3,2.2,2.9,2.9,3.1,3,2.7,2.2,2.5,3.2,2.8,2.5,2.8,2.9,3,2.8,3,2.9,2.6,2.4,2.4,2.7,2.7,3,3.4,3.1,2.3,3,2.5,2.6,3,2.6,2.3,2.7,3,2.9,2.9,2.5,2.8,3.3,2.7,3,2.9,3,3,2.5,2.9,2.5,3.6,3.2,2.7,3,2.5,2.8,3.2,3,3.8,2.6,2.2,3.2,2.8,2.8,2.7,3.3,3.2,2.8,3,2.8,3,2.8,3.8,2.8,2.8,2.6,3,3.4,3.1,3,3.1,3.1,3.1,2.7,3.2,3.3,3,2.5,3,3.4,3],[-0.743333333333334,-0.943333333333333,-1.14333333333333,-1.24333333333333,-0.843333333333334,-0.443333333333333,-1.24333333333333,-0.843333333333334,-1.44333333333333,-0.943333333333333,-0.443333333333333,-1.04333333333333,-1.04333333333333,-1.54333333333333,-0.0433333333333339,-0.143333333333334,-0.443333333333333,-0.743333333333334,-0.143333333333334,-0.743333333333334,-0.443333333333333,-0.743333333333334,-1.24333333333333,-0.743333333333334,-1.04333333333333,-0.843333333333334,-0.843333333333334,-0.643333333333334,-0.643333333333334,-1.14333333333333,-1.04333333333333,-0.443333333333333,-0.643333333333334,-0.343333333333334,-0.943333333333333,-0.843333333333334,-0.343333333333334,-0.943333333333333,-1.44333333333333,-0.743333333333334,-0.843333333333334,-1.34333333333333,-1.44333333333333,-0.843333333333334,-0.743333333333334,-1.04333333333333,-0.743333333333334,-1.24333333333333,-0.543333333333334,-0.843333333333334,1.15666666666667,0.556666666666667,1.05666666666667,-0.343333333333334,0.656666666666666,-0.143333333333334,0.456666666666666,-0.943333333333333,0.756666666666666,-0.643333333333334,-0.843333333333334,0.0566666666666666,0.156666666666666,0.256666666666666,-0.243333333333334,0.856666666666666,-0.243333333333334,-0.0433333333333339,0.356666666666666,-0.243333333333334,0.0566666666666666,0.256666666666666,0.456666666666666,0.256666666666666,0.556666666666667,0.756666666666666,0.956666666666666,0.856666666666666,0.156666666666666,-0.143333333333334,-0.343333333333334,-0.343333333333334,-0.0433333333333339,0.156666666666666,-0.443333333333333,0.156666666666666,0.856666666666666,0.456666666666666,-0.243333333333334,-0.343333333333334,-0.343333333333334,0.256666666666666,-0.0433333333333339,-0.843333333333334,-0.243333333333334,-0.143333333333334,-0.143333333333334,0.356666666666666,-0.743333333333334,-0.143333333333334,0.456666666666666,-0.0433333333333339,1.25666666666667,0.456666666666666,0.656666666666666,1.75666666666667,-0.943333333333333,1.45666666666667,0.856666666666666,1.35666666666667,0.656666666666666,0.556666666666667,0.956666666666666,-0.143333333333334,-0.0433333333333339,0.556666666666667,0.656666666666666,1.85666666666667,1.85666666666667,0.156666666666666,1.05666666666667,-0.243333333333334,1.85666666666667,0.456666666666666,0.856666666666666,1.35666666666667,0.356666666666666,0.256666666666666,0.556666666666667,1.35666666666667,1.55666666666667,2.05666666666667,0.556666666666667,0.456666666666666,0.256666666666666,1.85666666666667,0.456666666666666,0.556666666666667,0.156666666666666,1.05666666666667,0.856666666666666,1.05666666666667,-0.0433333333333339,0.956666666666666,0.856666666666666,0.856666666666666,0.456666666666666,0.656666666666666,0.356666666666666,0.0566666666666666],[0.442666666666667,-0.0573333333333332,0.142666666666667,0.0426666666666669,0.542666666666667,0.842666666666667,0.342666666666667,0.342666666666667,-0.157333333333333,0.0426666666666669,0.642666666666667,0.342666666666667,-0.0573333333333332,-0.0573333333333332,0.942666666666667,1.34266666666667,0.842666666666667,0.442666666666667,0.742666666666667,0.742666666666667,0.342666666666667,0.642666666666667,0.542666666666667,0.242666666666667,0.342666666666667,-0.0573333333333332,0.342666666666667,0.442666666666667,0.342666666666667,0.142666666666667,0.0426666666666669,0.342666666666667,1.04266666666667,1.14266666666667,0.0426666666666669,0.142666666666667,0.442666666666667,0.542666666666667,-0.0573333333333332,0.342666666666667,0.442666666666667,-0.757333333333333,0.142666666666667,0.442666666666667,0.742666666666667,-0.0573333333333332,0.742666666666667,0.142666666666667,0.642666666666667,0.242666666666667,0.142666666666667,0.142666666666667,0.0426666666666669,-0.757333333333333,-0.257333333333333,-0.257333333333333,0.242666666666667,-0.657333333333333,-0.157333333333333,-0.357333333333333,-1.05733333333333,-0.0573333333333332,-0.857333333333333,-0.157333333333333,-0.157333333333333,0.0426666666666669,-0.0573333333333332,-0.357333333333333,-0.857333333333333,-0.557333333333333,0.142666666666667,-0.257333333333333,-0.557333333333333,-0.257333333333333,-0.157333333333333,-0.0573333333333332,-0.257333333333333,-0.0573333333333332,-0.157333333333333,-0.457333333333333,-0.657333333333333,-0.657333333333333,-0.357333333333333,-0.357333333333333,-0.0573333333333332,0.342666666666667,0.0426666666666669,-0.757333333333333,-0.0573333333333332,-0.557333333333333,-0.457333333333333,-0.0573333333333332,-0.457333333333333,-0.757333333333333,-0.357333333333333,-0.0573333333333332,-0.157333333333333,-0.157333333333333,-0.557333333333333,-0.257333333333333,0.242666666666667,-0.357333333333333,-0.0573333333333332,-0.157333333333333,-0.0573333333333332,-0.0573333333333332,-0.557333333333333,-0.157333333333333,-0.557333333333333,0.542666666666667,0.142666666666667,-0.357333333333333,-0.0573333333333332,-0.557333333333333,-0.257333333333333,0.142666666666667,-0.0573333333333332,0.742666666666667,-0.457333333333333,-0.857333333333333,0.142666666666667,-0.257333333333333,-0.257333333333333,-0.357333333333333,0.242666666666667,0.142666666666667,-0.257333333333333,-0.0573333333333332,-0.257333333333333,-0.0573333333333332,-0.257333333333333,0.742666666666667,-0.257333333333333,-0.257333333333333,-0.457333333333333,-0.0573333333333332,0.342666666666667,0.0426666666666669,-0.0573333333333332,0.0426666666666669,0.0426666666666669,0.0426666666666669,-0.357333333333333,0.142666666666667,0.242666666666667,-0.0573333333333332,-0.557333333333333,-0.0573333333333332,0.342666666666667,-0.0573333333333332]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>x<\/th>\n      <th>y<\/th>\n      <th>x.resid<\/th>\n      <th>y.resid<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"columnDefs":[{"className":"dt-right","targets":[1,2,3,4]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

<br>
]

---

```r
covariance <- function(x, y) {
	n <- length(x) # Length of x must be = length of y
	{(x - mean(x)) * (y - mean(y))} %>% sum() %>% divide_by(n-1)
}
```

.font2[
- This function will help us calculating the covariance
- Notice how it forms a computational sequence?
]

```r
covariance(tbl$x, tbl$y)
```

```
## [1] -0.042
```

```r
cov(tbl$x, tbl$y) # Built-in function
```

```
## [1] -0.042
```

---

```r
covariance(tbl$x, tbl$x)
```

```
## [1] 0.69
```

```r
var(tbl$x) # Variance of x
```

```
## [1] 0.69
```

???

- Covariance of one variable is the **variance**

.font2[
`$$s_{x, x} = \frac{\displaystyle \sum_{i=1}^n(x_i - \color{orange}{\bar{x}}) (x_i - \color{orange}{\bar{x}})}{(\color{orange}{n-1})}$$`
]

---

```r
tbl <- subset(iris, select=-Species) %T>% str()
```

```
## 'data.frame':	150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
```

```r
cov(tbl)
```

```
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length        0.686      -0.042         1.27        0.52
## Sepal.Width        -0.042       0.190        -0.33       -0.12
## Petal.Length        1.274      -0.330         3.12        1.30
## Petal.Width         0.516      -0.122         1.30        0.58
```

???

- This provides a splendid example on covariance matrix

---

.column.bg-main1[.vmiddle.content[
- Covariance
- .amber[Pearson's] `$r$`
- Spearman's `$\rho$`
- Kendall's `$\tau$`
]]

---

# Pearson's `$r$`

---

.font2[
- Moment product correlation
- Describes the trend
- Also the .amber[magnitude]
- Dimension free
]

---

.font2[
`\begin{align}
r &= \frac{\color{orange}{s_{x,y}}}{s_x \cdot s_y} \\
  &= \displaystyle \sum_{i=1}^n \frac{\color{orange}{(x-\bar{x}) (y-\bar{y})}} {\color{orange}{(n-1)} \cdot s_x \cdot s_y} \\
  &= \displaystyle \sum_{i=1}^n \frac{\big( \frac{x-\bar{x}}{s_x} \big) \cdot \big( \frac{y-\bar{y}}{s_y} \big)}{n-1}
\end{align}`
]

???

- Concept recall: Z-score

---

.font2[
- It describes the relationship between .amber[two] numeric variables
- Both variables needs to follow a normal distribution
- Recall: `$Z \sim N(0, 1)$`
- Since `$r \sim Z \to r$` does not care for the unit!
]

---

- `$t \sim T(\nu)$`
- There exists another method of determining the significance
]

---

# Assumptions

???

- Important concept: joint distribution
- When the data follows a bivariate normal distribution, Pearson's `$r$` can
  completely describe the relationship
- However, bivariate normality is not a stringent assumption per se
- Could not address non-linearity

## Hypotheses

.font2[
- `$H_0$`: Both variables do not have a linear relationship
- `$H_1$`: Both variables have a linear relationship
]

---

# Example, please?

---

```r
lapply(tbl, shapiro.test) %>% lapply(broom::tidy) %>% lapply(data.frame) %>%
	{do.call(rbind, .)} %>% kable() %>% kable_minimal()
```

<table class=" lightable-minimal" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto;'>
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
   <th style="text-align:left;"> method </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Sepal.Length </td>
   <td style="text-align:right;"> 0.98 </td>
   <td style="text-align:right;"> 0.01 </td>
   <td style="text-align:left;"> Shapiro-Wilk normality test </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Sepal.Width </td>
   <td style="text-align:right;"> 0.98 </td>
   <td style="text-align:right;"> 0.10 </td>
   <td style="text-align:left;"> Shapiro-Wilk normality test </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Petal.Length </td>
   <td style="text-align:right;"> 0.88 </td>
   <td style="text-align:right;"> 0.00 </td>
   <td style="text-align:left;"> Shapiro-Wilk normality test </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Petal.Width </td>
   <td style="text-align:right;"> 0.90 </td>
   <td style="text-align:right;"> 0.00 </td>
   <td style="text-align:left;"> Shapiro-Wilk normality test </td>
  </tr>
</tbody>
</table>

---

???

- Sepal width follows a normal distribution
- Sepal length *closely* follow a normal distribution
- Not many normality violations in sepal length (checked using qqplot)
- We shall see whether our data follow a bivariate normal distribution

---

```r
subset(tbl, select=c(Sepal.Length, Sepal.Width)) %>%
	MVN::mvn() # Multivariate normality
```

```
## $multivariateNormality
##              Test          Statistic            p value Result
## 1 Mardia Skewness   9.46144098216623 0.0505456076692465    YES
## 2 Mardia Kurtosis -0.853178029438543  0.393560585232763    YES
## 3             MVN               <NA>               <NA>    YES
## 
## $univariateNormality
##           Test     Variable Statistic   p value Normality
## 1 Shapiro-Wilk Sepal.Length      0.98      0.01    NO    
## 2 Shapiro-Wilk Sepal.Width       0.98      0.10    YES   
## 
## $Descriptives
##                n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
## Sepal.Length 150  5.8    0.83    5.8 4.3 7.9  5.1  6.4 0.31    -0.61
## Sepal.Width  150  3.1    0.44    3.0 2.0 4.4  2.8  3.3 0.31     0.14
```

???

- Multivariate normality test is a general form of measuring bivariate
  normality
- We use Mardia's test for this purpose

---

```r
cor.test(tbl$Sepal.Length, tbl$Sepal.Width)
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  tbl$Sepal.Length and tbl$Sepal.Width
## t = -1, df = 148, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.273  0.044
## sample estimates:
##   cor 
## -0.12
```

.font2[
- `$\color{red}{-1} \leq r \leq \color{orange}{1}$`
- .red[Negative] and .orange[positive] trends
]

---

---

.column.bg-main1[.vmiddle.content[
- Covariance
- Pearson's `$r$`
- .amber[Spearman's] `$\rho$`
- Kendall's `$\tau$`
]]

---

# Spearman's `$\rho$`

---

.font2[
- A non-parametric variant of Pearson's `$r$`
- Suitable to handle ordinal data
- In some cases: applicable for non-normally distributed numeric data
- Not sufficient to correctly handle tied values
]

---

.font2[
`\begin{align}
\rho &= 1 - \frac{6 \sum (R_x - R_y)^2}{n (n^2 - 1)} \\
\nu  &= n - 2 \tag{DoF}
\end{align}`
]

.font2[
- `$R_{x,y}$` is the rank for `$X, Y$`
- Ranking follows an order within one variable, i.e. .amber[not] by pooling the data
- By assigning rank, we can address non-linearity to a certain degree
]

???

- As an alternative to this equation, we can use Pearson's `$r$`
- But we need to use the rank instead of the actual data element

---

- `$t \sim T(\nu)$`
- Handle ties by taking the average value of ranks
- Tie `$\to$` Has little confidence in determining the p-value
]

---

# Assumptions

---

# Example, please?

---

## .red[Disclaimer!]

.font2[
- This example is only for an illustrative purpose
- We will re-use a subset on the `iris` dataset
]

---

```r
tbl <- subset(iris, select=-Species) %T>% str()
```

```r
cov(tbl)
```

---

```r
cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="spearman")
```

```
## 
## 	Spearman's rank correlation rho
## 
## data:  tbl$Sepal.Length and tbl$Sepal.Width
## S = 7e+05, p-value = 0.04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##   rho 
## -0.17
```

.font2[
- `$\color{red}{-1} \leq \rho \leq \color{orange}{1}$`
- .red[Negative] and .orange[positive] trends
]

---

.column.bg-main1[.vmiddle.content[
- Covariance
- Pearson's `$r$`
- Spearman's `$\rho$`
- .amber[Kendall's] `$\tau$`
]]

---

# Kendall's `$\tau$`

---

???

- `$\tau_a$`: Square table
- `$\tau_b$`: Square table, handles tie
- `$\tau_c$`: Rectangular table, handles tie
- Most applicable on an ordinal data

---

.font2[
- For `$i, j \in X, Y: i \neq j,\ \exists\ (x_{i, j}, y_{i, j})$`
- Concordant: `$(x_i < x_j \ \texttt{and}\  y_i < y_j) \lor (x_i > x_j \ \texttt{and}\  y_i > y_j)$`
- Discordant: `$(x_i < x_j \ \texttt{and}\  y_i \nless y_j) \lor (x_i > x_j \ \texttt{and}\  y_i \ngtr y_j)$`
]

???

- Concordant: pairs with similar symbols
- Discordant: pairs with dissimilar symbols

---

.font2[
`\begin{align}
\tau_a &= \frac{n_c - n_d}{n}\\
\tau_b &= \frac{n_c - n_d}{\sqrt{(n + X_0) (n + Y_0)}} \\
\tau_c &= \frac{2(n_c - n_d)}{n^2 \frac{(m-1)}{m}} \\
n      &= \binom{n}{2}
\end{align}`
]

???

- Square table: both variables are ordinal with the same scale
- Rectangular table: both variables have different measurement scales
- `$n_c$`: Number of concordant pairs
- `$n_d$`: Number of discordant pairs
- `$n$`: Total number of possible pairs
- `$m$`: `$min(r, c): r$` is the row and `$c$` is the column
- `$X_0, Y_0$`: Ties in either X or Y
- Most statistical software employs Kendall's `$\tau_b$`

---

# Example, please?

---

```r
tbl <- subset(iris, select=-Species) %T>% str()
```

```r
cov(tbl)
```

---

```r
cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="kendall")
```

```
## 
## 	Kendall's rank correlation tau
## 
## data:  tbl$Sepal.Length and tbl$Sepal.Width
## z = -1, p-value = 0.2
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##    tau 
## -0.077
```

.font2[
- `$\color{orange}{0} \leq \tau \leq \color{orange}{1}$`
- Interpret the absolute value of `$\tau$`
- Base `R` only implements `$\tau_a$`, other methods exist in a specific packages
]

---

# Recap

.font2[
- Check normality
- Check linearity
- Non-parametric test: determine the presence of tie
- Perform correlation
- Create the plot (if necessary)
]

---

# Caveats

.font2[
- We only discussed .orange[*some*] of the popular correlation test
- All discussed methods assume .orange[I.I.D]
- .orange[Paired data] is suitable for .orange[none] of discussed methods
- .orange[Time series] data requires a different approach
- Correlation `$\color{orange}{\neq}$` Causation
]

---

# Is that all?

???

Short answer: no.

.font2[
- Concordance correlation coefficient
- Intraclass correlation
- Partial correlation
- Zero-order correlation
- The list goes on...
]

---