+ - 0:00:00
Notes for current slide
Notes for next slide

Correlation of Numeric Variables

Aly Lamuri
Indonesia Medical Education and Research Institute

1 / 20

Overview

  • Covariance
  • Pearson's r
  • Spearman's ρ
  • Kendall's τ
1 / 20

Covariance

  • Concept recall: variance
  • Describes a trend between two numeric variables
  • Does not define the magnitude
  • How does y behave if we know the value of x?
1 / 20

Covariance

σx,y=i=1n(xiμx)(yiμy)n

1 / 20
  • Concept recall: Bias and Bessel's correction

Covariance

σx,y=i=1n(xiμx)(yiμy)n

sx,y=i=1n(xix¯)(yiy¯)(n1)

1 / 20
  • Concept recall: Bias and Bessel's correction

Covariance matrix

  • Pairwise relationships between multiple numeric variables
  • Assessing trends at a glimpse
  • A useful descriptive statistics before designing a complex model
2 / 20

Example, please?

tbl <- subset(iris, select=c(Sepal.Width, Sepal.Length)) %>% str()
## 'data.frame': 150 obs. of 2 variables:
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
3 / 20

Example, please?

tbl <- subset(iris, select=c(Sepal.Width, Sepal.Length)) %>% str()
## 'data.frame': 150 obs. of 2 variables:
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
  • We will calculate how Sepal.Width covary with Sepal.Length
  • From here onwards, we will set x to represent the width
  • ...and y to represent the length
3 / 20

Example, please?



3 / 20

Example, please?

covariance <- function(x, y) {
n <- length(x) # Length of x must be = length of y
{(x - mean(x)) * (y - mean(y))} %>% sum() %>% divide_by(n-1)
}
  • This function will help us calculating the covariance
  • Notice how it forms a computational sequence?
4 / 20

Example, please?

covariance <- function(x, y) {
n <- length(x) # Length of x must be = length of y
{(x - mean(x)) * (y - mean(y))} %>% sum() %>% divide_by(n-1)
}
  • This function will help us calculating the covariance
  • Notice how it forms a computational sequence?
covariance(tbl$x, tbl$y)
## [1] -0.042
cov(tbl$x, tbl$y) # Built-in function
## [1] -0.042
4 / 20

Example, please?

How if we calculate covariances of the same variable?

5 / 20

Example, please?

How if we calculate covariances of the same variable?

covariance(tbl$x, tbl$x)
## [1] 0.69
var(tbl$x) # Variance of x
## [1] 0.69
5 / 20
  • Covariance of one variable is the variance

Example, please?

How if we calculate covariances of the same variable?

covariance(tbl$x, tbl$x)
## [1] 0.69
var(tbl$x) # Variance of x
## [1] 0.69

sx,x=i=1n(xix¯)(xix¯)(n1)

5 / 20
  • Covariance of one variable is the variance

Example, please?

tbl <- subset(iris, select=-Species) %T>% str()
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
cov(tbl)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.686 -0.042 1.27 0.52
## Sepal.Width -0.042 0.190 -0.33 -0.12
## Petal.Length 1.274 -0.330 3.12 1.30
## Petal.Width 0.516 -0.122 1.30 0.58
6 / 20
  • This provides a splendid example on covariance matrix

Overview

  • Covariance
  • Pearson's r
  • Spearman's ρ
  • Kendall's τ
6 / 20

Pearson's r

  • Moment product correlation
  • Describes the trend
  • Also the magnitude
  • Dimension free
7 / 20

Pearson's r

r=sx,ysxsy=i=1n(xx¯)(yy¯)(n1)sxsy=i=1n(xx¯sx)(yy¯sy)n1

7 / 20
  • Concept recall: Z-score

Pearson's r

r=ZxZyn1(DoF)ν=n2

8 / 20

Pearson's r

r=ZxZyn1(DoF)ν=n2

  • It describes the relationship between two numeric variables
  • Both variables needs to follow a normal distribution
  • Recall: ZN(0,1)
  • Since rZr does not care for the unit!
8 / 20

Pearson's r

t=r1r2n2

  • tT(ν)
  • There exists another method of determining the significance
9 / 20

Assumptions

  • I.I.D
  • Univariate normality
  • Bivariate normality
  • Has a linear relationship
10 / 20
  • Important concept: joint distribution
  • When the data follows a bivariate normal distribution, Pearson's r can completely describe the relationship
  • However, bivariate normality is not a stringent assumption per se
  • Could not address non-linearity

Assumptions

  • I.I.D
  • Univariate normality
  • Bivariate normality
  • Has a linear relationship

Hypotheses

  • H0: Both variables do not have a linear relationship
  • H1: Both variables have a linear relationship
10 / 20
  • Important concept: joint distribution
  • When the data follows a bivariate normal distribution, Pearson's r can completely describe the relationship
  • However, bivariate normality is not a stringent assumption per se
  • Could not address non-linearity

Example, please?

lapply(tbl, shapiro.test) %>% lapply(broom::tidy) %>% lapply(data.frame) %>%
{do.call(rbind, .)} %>% kable() %>% kable_minimal()
statistic p.value method
Sepal.Length 0.98 0.01 Shapiro-Wilk normality test
Sepal.Width 0.98 0.10 Shapiro-Wilk normality test
Petal.Length 0.88 0.00 Shapiro-Wilk normality test
Petal.Width 0.90 0.00 Shapiro-Wilk normality test
11 / 20

Example, please?

11 / 20
  • Sepal width follows a normal distribution
  • Sepal length closely follow a normal distribution
  • Not many normality violations in sepal length (checked using qqplot)
  • We shall see whether our data follow a bivariate normal distribution

Example, please?

subset(tbl, select=c(Sepal.Length, Sepal.Width)) %>%
MVN::mvn() # Multivariate normality
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 9.46144098216623 0.0505456076692465 YES
## 2 Mardia Kurtosis -0.853178029438543 0.393560585232763 YES
## 3 MVN <NA> <NA> YES
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk Sepal.Length 0.98 0.01 NO
## 2 Shapiro-Wilk Sepal.Width 0.98 0.10 YES
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th Skew Kurtosis
## Sepal.Length 150 5.8 0.83 5.8 4.3 7.9 5.1 6.4 0.31 -0.61
## Sepal.Width 150 3.1 0.44 3.0 2.0 4.4 2.8 3.3 0.31 0.14
11 / 20
  • Multivariate normality test is a general form of measuring bivariate normality
  • We use Mardia's test for this purpose

Example, please?

cor.test(tbl$Sepal.Length, tbl$Sepal.Width)
##
## Pearson's product-moment correlation
##
## data: tbl$Sepal.Length and tbl$Sepal.Width
## t = -1, df = 148, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.273 0.044
## sample estimates:
## cor
## -0.12
12 / 20

Example, please?

cor.test(tbl$Sepal.Length, tbl$Sepal.Width)
##
## Pearson's product-moment correlation
##
## data: tbl$Sepal.Length and tbl$Sepal.Width
## t = -1, df = 148, p-value = 0.2
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.273 0.044
## sample estimates:
## cor
## -0.12
  • 1r1
  • Negative and positive trends
12 / 20

Example, please?

12 / 20

Overview

  • Covariance
  • Pearson's r
  • Spearman's ρ
  • Kendall's τ
12 / 20

Spearman's ρ

  • A non-parametric variant of Pearson's r
  • Suitable to handle ordinal data
  • In some cases: applicable for non-normally distributed numeric data
  • Not sufficient to correctly handle tied values
13 / 20

Spearman's ρ

ρ=16(RxRy)2n(n21)(DoF)ν=n2

13 / 20

Spearman's ρ

ρ=16(RxRy)2n(n21)(DoF)ν=n2

  • Rx,y is the rank for X,Y
  • Ranking follows an order within one variable, i.e. not by pooling the data
  • By assigning rank, we can address non-linearity to a certain degree
13 / 20
  • As an alternative to this equation, we can use Pearson's r
  • But we need to use the rank instead of the actual data element

Spearman's ρ

t=ρ1ρ2n2

  • tT(ν)
  • Handle ties by taking the average value of ranks
  • Tie Has little confidence in determining the p-value
13 / 20

Assumptions

  • I.I.D
  • Monotonic trend
  • Has a natural order
14 / 20

Example, please?

Disclaimer!

  • This example is only for an illustrative purpose
  • We will re-use a subset on the iris dataset
15 / 20

Example, please?

tbl <- subset(iris, select=-Species) %T>% str()
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
cov(tbl)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.686 -0.042 1.27 0.52
## Sepal.Width -0.042 0.190 -0.33 -0.12
## Petal.Length 1.274 -0.330 3.12 1.30
## Petal.Width 0.516 -0.122 1.30 0.58
15 / 20

Example, please?

cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="spearman")
##
## Spearman's rank correlation rho
##
## data: tbl$Sepal.Length and tbl$Sepal.Width
## S = 7e+05, p-value = 0.04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.17
15 / 20

Example, please?

cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="spearman")
##
## Spearman's rank correlation rho
##
## data: tbl$Sepal.Length and tbl$Sepal.Width
## S = 7e+05, p-value = 0.04
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.17
  • 1ρ1
  • Negative and positive trends
15 / 20

Overview

  • Covariance
  • Pearson's r
  • Spearman's ρ
  • Kendall's τ
15 / 20

Kendall's τ

  • Non-parametric
  • Methods: τa,τb,τc
  • Concordant and discordant pairs
16 / 20
  • τa: Square table
  • τb: Square table, handles tie
  • τc: Rectangular table, handles tie
  • Most applicable on an ordinal data

Kendall's τ

  • For i,jX,Y:ij,  (xi,j,yi,j)
  • Concordant: (xi<xj and yi<yj)(xi>xj and yi>yj)
  • Discordant: (xi<xj and yiyj)(xi>xj and yiyj)
16 / 20
  • Concordant: pairs with similar symbols
  • Discordant: pairs with dissimilar symbols

Kendall's τ

τa=ncndnτb=ncnd(n+X0)(n+Y0)τc=2(ncnd)n2(m1)mn=(n2)

16 / 20
  • Square table: both variables are ordinal with the same scale
  • Rectangular table: both variables have different measurement scales
  • nc: Number of concordant pairs
  • nd: Number of discordant pairs
  • n: Total number of possible pairs
  • m: min(r,c):r is the row and c is the column
  • X0,Y0: Ties in either X or Y
  • Most statistical software employs Kendall's τb

Example, please?

tbl <- subset(iris, select=-Species) %T>% str()
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
cov(tbl)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 0.686 -0.042 1.27 0.52
## Sepal.Width -0.042 0.190 -0.33 -0.12
## Petal.Length 1.274 -0.330 3.12 1.30
## Petal.Width 0.516 -0.122 1.30 0.58
17 / 20

Example, please?

cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="kendall")
##
## Kendall's rank correlation tau
##
## data: tbl$Sepal.Length and tbl$Sepal.Width
## z = -1, p-value = 0.2
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.077
17 / 20

Example, please?

cor.test(tbl$Sepal.Length, tbl$Sepal.Width, method="kendall")
##
## Kendall's rank correlation tau
##
## data: tbl$Sepal.Length and tbl$Sepal.Width
## z = -1, p-value = 0.2
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.077
  • 0τ1
  • Interpret the absolute value of τ
  • Base R only implements τa, other methods exist in a specific packages
17 / 20

Recap

  • Check normality
  • Check linearity
  • Non-parametric test: determine the presence of tie
  • Perform correlation
  • Create the plot (if necessary)
18 / 20

Caveats

  • We only discussed some of the popular correlation test
  • All discussed methods assume I.I.D
  • Paired data is suitable for none of discussed methods
  • Time series data requires a different approach
  • Correlation Causation
19 / 20

Is that all?

20 / 20

Short answer: no.

Is that all?

  • Concordance correlation coefficient
  • Intraclass correlation
  • Partial correlation
  • Zero-order correlation
  • The list goes on...
20 / 20

Short answer: no.

Query?

20 / 20

Overview

  • Covariance
  • Pearson's r
  • Spearman's ρ
  • Kendall's τ
1 / 20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow