Mathematics for Machine Learning

.bg-main4.column[.vmiddle.content[
.amber[Aly Lamuri]  
Department of Medical Physics
]]

---

---

.bg-main1.column[.vmiddle.content[
- .amber[Linear algebra]
- Calculus
- Probability and statistics
]]

---

# Linear algebra

.font2[
- A set of rules to manipulate mathematical .amber[objects]
- Operates in a defined .amber[spaces]
- Defines a relationship between multiple variables
]

---

???

Object in math depends on which branch it is defined upon, in linear algebra:
- Scalar
- Vector
- Matrix
- Tensor

Spaces:
- Vector
- Metric
- Normed
- Inner product

---

## How does it look like?

---

## Geometry and algebra?

???

- We often assume flat euclidean space
- Violation on any of Euclid's postulate results in non-euclidean space
- Euclidean approximation does not generalize well to non-euclidean space

---

.bg-white.column[.vmiddle.center.content[
<img src="http://www.pitt.edu/~jdnorton/teaching/HPS_0410/chapters/non_Euclid_Euclid/Euclid/Billingsley's_Elements_1570.jpg" height="100%">
]]

.font2[
- Joining two points `$\to$` straight line
- Straight lines indefinitely extends as is
- Rotating a straight line `$\to$` a circle
- All right angles are perpendicular
]
]]

???

When rotating a straight line on one endpoint, it will form a circle, where the
length of such a straight line equates to the radius

Where's the fifth one?

---

# The fifth postulate

.font2[
If .amber[two lines] are drawn which .pink[intersect a third] in such a way that the *sum of
the inner angles* on one side is less than two right angles, then the two lines
inevitably must .amber[intersect each other] on that side if extended far enough
]

???

- Sounds reasonable
- But only on a flat surface
- What if we embed such a surface into curvatures?

---

# What do we use Euclidean space for?

.font2[
- .amber[Modelling] our real world :)
- Make an .amber[approximation] we can comprehend
- Estimate a .amber[distance] between two points
]

# Example in machine learning, please

.font2[
- Hierarchical clustering
- Principal component analysis
- And many more to jump on the bandwagon!
]

---

# Hierarchical clustering

???

- In hierarchical clustering, first we determine the distance between two data
  points to determine their closeness
- Closer distance = higher similarity

---

# PCA

???

- PCA maximizes the spaces in reduced dimension (e.g. from a two dimension to
  one, it will calculate reduced dimension which variances is the largest)

---

# A violation of Euclid's postulate?

???

- Let `$X$` be a curved surface, embedded by a two-dimensional euclidean space
- None of the straight line will follow Euclid's postulates
- Hence: non-euclidean space

## Okay, geometry is cool and all

## But, what does it have to do with machine learning?

---

# Caveats in (some) machine learning models

.font2[
- We .amber[assume] our data lives in a .pink[flat] mathematical space
- High .amber[dimensionality] does not play well with Euclid's postulates
- Some address this limitation by developing a non-euclidean algorithm
- This course will mostly consider an Euclidean space
]

???

- Curse of dimensionality: phenomena occurring only in high-dimensional data
- More dimension `$\to$` requires more data

---

# Back to linear algebra

.font2[
- We have seen how algebra relates with geometry
- We know some limitations we have when performing algebraic operations
- Now we can safely proceed to understand the role of algebra in machine learning!
]

# .amber[Quick recall:] spaces in linear algebra

---

# Vector space

.font2[
- Governs multiplication and addition
- Produce objects of the same classes
- What are classes?
- Defining a system of linear equations
]

???

Classes:
- Geometric vector: what we have learnt during high school
- Polynomial
- Signals
- Tuple: what we mainly concern ourselves with in machine learning

---

# System of linear equation

.font2[
`\begin{align}
Let\ X &= (\mathbf{x_1}, \mathbf{x_2}, ..., \mathbf{x_n}) \\
A &= (\alpha_1, \alpha_2, ..., \alpha_m) : \forall A, X \in \mathbb{R},\ then \\
\exists f(x) &= \displaystyle \sum_1^j \sum_1^i \alpha_{ij} \cdot \mathbf{x}_i = \Pi
\end{align}`
]

???

- Let there be a sequence of vector **X** and scalar `$A$`, there exists a
  function defining the sum of product between **x** and `$\alpha$`

---

# Example, please?

.font2[
`\begin{align}
4x_1 + 4x_2 &= 5 \\
2x_1 - 4x_2 &= 1 \tag{+} \\
\hline
6x_1 &= 6 \\
\end{align}`
]

---

# What do we learn?

.font2[
- System of linear algebra is not complicated
- It is intuitive to solve
- At a glance, we can see a .amber[pattern] of their solution
- But a computer does not rely on intuition
- How do we teach a computer to solve them?
]

## Solution: use .amber[matrices]

---

# Re-formulate our solution as a matrix

.font2[
`\begin{align}
\begin{bmatrix} 4 & 4 \\ 2 & -4 \end{bmatrix}
\cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &=
\begin{bmatrix} 5 \\ 1 \end{bmatrix}
\\
Let\ A &:= \begin{bmatrix} 4 & 4 \\ 2 & -4 \end{bmatrix}
\\
A \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &=
\begin{bmatrix} 5 \\ 1 \end{bmatrix}
\\
A^{-1} \cdot A \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &=
A^{-1} \begin{bmatrix} 5 \\ 1 \end{bmatrix} \\
\end{align}`
]

---

# Calculate an inverse matrix `$A^{-1}$`

---

.font2.center[
Recall that `$A \cdot A^{-1} = I$`
`\begin{align}
\left[\begin{array}{cc|cc}
4 & 4 & 1 & 0 \\
2 & -4 & 0 & 1
\end{array}\right]
\end{align}`
Starts by having the original and identity matrix side by side
]

---

.font2.center[
Add the second row into the first
`\begin{align}
\left[\begin{array}{cc|cc}
6 & 0 & 1 & 1 \\
2 & -4 & 0 & 1
\end{array}\right]
\end{align}`
]

---

.font2.center[
Divide the first row by 6
`\begin{align}
\left[\begin{array}{cc|cc}
1 & 0 & \frac{1}{6} & \frac{1}{6} \\
2 & -4 & 0 & 1
\end{array}\right]
\end{align}`
]

---

.font2.center[
Divide the second row by 2
`\begin{align}
\left[\begin{array}{cc|cc}
1 & 0 & \frac{1}{6} & \frac{1}{6} \\
1 & -2 & 0 & \frac{1}{2}
\end{array}\right]
\end{align}`
]

---

.font2.center[
Subtract the first row from the second
`\begin{align}
\left[\begin{array}{cc|cc}
1 & 0 & \frac{1}{6} & \frac{1}{6} \\
0 & -2 & - \frac{1}{6} & \frac{1}{3}
\end{array}\right]
\end{align}`
]

---

.font2.center[
Divide the second row by -2
`\begin{align}
\left[\begin{array}{cc|cc}
1 & 0 & \frac{1}{6} & \frac{1}{6} \\
0 & 1 & \frac{1}{12} & - \frac{1}{6}
\end{array}\right]
\end{align}`
Now you have `$I$` on the left and `$A^{-1}$` on the right :)
]

---

.font2[
- We call this method an .amber[elementary row operation]
- Basically you can multiply your **rows** by any scalar
- But you can only subtract / add one row with the other
- Swapping the row is perfectly fine though
]

???

This is one of the simplest method to get `$A^{-1}$` in `$m \times m$` matrices

---

# Solving the equation

.font2[
`\begin{align}
A^{-1} \cdot A \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &=
A^{-1} \begin{bmatrix} 5 \\ 1 \end{bmatrix}
\\
\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &=
\begin{bmatrix}\frac{1}{6} & \frac{1}{6} \\ \frac{1}{12} & - \frac{1}{6}\end{bmatrix}
\cdot \begin{bmatrix} 5 \\ 1 \end{bmatrix}
\\
\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &=
\begin{bmatrix} 1 \\ \frac{1}{4} \end{bmatrix}
\end{align}`
]

---

# Why using matrices?

.font2[
- Computers understand matrices as a mathematical construct
- They can easily perform mathematical operation
- Matrix is generalizable into a higher dimensional object, i.e. a tensor
- Dot product `$\to$` tensor product
]

# Hardware-wise

.font2[
- .amber[CPU:] General arithmetic operation
- .amber[GPU:] Matrix manipulation
- .amber[TPU:] Tensor manipulation
]

---

# Are all matrices invertible?

.font2[
- No. Sorry to disappoint :(
- Non-invertible matrix: regular / non-singular
- Only invertible matrices provide .amber[unique solutions]
]

---

# Metric space

---

.font2[
- A space where we consider the distance between two data points
- Has a pair of objects, one for the set another for the distance
- Some specific properties?
]

`\begin{align}
d: X \times X \to [0, \infty)
\end{align}`

???

- `$X$` is a set of real numbers

---

.font2[
`\begin{align}
Let\ (x, y) & \in X \\
d(x, y) &= 0 \iff x = y \\
d(x, y) & \geqslant 0 \iff x \neq y \\
d(x, y) &= d(y, x) \\
d(x, y) & \leqslant d(x, z) + d(z, y)
\end{align}`
]

???

- Let a pair `$(x, y)$` as a part of set `$X$`
- Distance function `$d$` may vary

---

# Distance function

.font2[
- Different functions exist!
- We need a measure to define how *dissimilar* our data points are
]

---

???

- Euclidean
- Manhattan
- Minkowski

---

## Euclidean distance

- Measure the shortest distance in an .amber[Euclidean] space
- Most commonly used distance measure
]

???

`$n$` is the number of dimension

---

## Manhattan distance

- An absolute distance in an .amber[Euclidean] space
- Optimally used in a higher dimension data
]

---

## Minkowski distance

- Useful in a .amber[non-euclidean] space
- Not a part of metric space, but still nice to know
]

???

- `$p$`: Order of the norm
- A more general form of the former two distance measures
- When `$p=1$`, it behaves as a Manhattan distance
- When `$p=2$`, it behaves as an Euclidean distance
- Fun fact: Einstein's special relativity used Minkowski distance

---

# Comparison on distance measures

---

# Normed linear space

.font2[
- Non-negative map on the vector space
- Formalization of the length of `$x$`
- Utilized norm: a real-valued function
]

???

Example: Hilbert space, `$L^p$` space

# Inner product space

.font2[
- Basically a vector space with inner product
- Mapping two vectors into a scalar
- Inner product: a generalization of dot product
]

???

Imagine inner product as a dot product without dimensionality constraint

---

.bg-main1.column[.vmiddle.content[
- Linear algebra
- .amber[Calculus]
- Probability and statistics
]]

???

Summary:
- Linear algebra: a set of mathematical operations over specific spaces
- Vector spaces: we deal a lot with matrices and tensors
- Application: direct implementation is gradient descent
- Algebra deals with relations

---

# Calculus

.font2[
- Optimizing machine learning algorithm
- Minimizing cost function
- Finding extrema: minima or maxima?
- Determine the weight in our model
]

???

- Cost function: akin to `$\epsilon$` when performing regression
- Lower `$\epsilon \to$` lower randomness

---

# Gradient

---

---

`\begin{align}
\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}
\end{align}`

---

.bg-main1.column[.vmiddle.content[
- Linear algebra
- Calculus
- .amber[Probability and statistics]
]]

---

# Probability and statistics

- Both frequentist and Bayesian statistics revolve around conditional probability
- Only Bayesian statistics allows updating prior belief on probability function
- But, what is .amber[statistics in essence]?
]

---

## Population

<br>

## Parameters

Quantitative .amber[summary] of a population

<br>

## Be more .amber[specific,] please?

`$X$`: Data element  
`$N$`: Number of element  
`$p$`: Proportion  
`$M$`: .yellow[Median]  
`$\mu$`: .yellow[Average]  
`$\sigma$`: .yellow[Standard deviation]  
`$\sigma^2$`: .yellow[Variance]  
`$\rho$`: Correlation coefficient  
]]

---

## Sample

A .red[subset] of an observable population

<br>

## Statistics

Quantitative .red[summary] of a sample

<br>

## Be more .red[specific,] please?

`$x$`: Data element  
`$n$`: Number of element  
`$\hat{p}$`: Proportion  
`$m$`: .deep-orange[Median]  
`$\bar{x}$`: .deep-orange[Average]  
`$s$`: .deep-orange[Standard deviation]  
`$s^2$`: .deep-orange[Variance]  
`$r$`: Correlation coefficient  
]]

---

- An .amber[event] `$E$` occurring within a particular .red[sample space] `$S$`
- .amber[Event]: Expected results
- .red[Sample space]: All possible outcomes
- Probability `$P$` is a proportion of event divided by its sample space
- Or mathematically:

`$$P(E=e) = \frac{E}{S}$$`
]

---

# Random Variables

- All sampled random variables should be .amber[independent] from one another
- Each sampling procedure have to be .amber[identical], as to produce similar probability

- If they are I.I.D, we can approximate the probability using:
  - Probability .amber[Mass] Function (.amber[discrete] variable)
  - Probability .amber[Density] Function (.amber[continuous] variable)

`\begin{align}
P(E=e) &= f(e) > 0: E \in S \tag{1} \\
\displaystyle \sum_{e \in S} f(e) &= 1 \tag{2} \\
P(E \in A) &= \displaystyle \sum_{e \in A} f(e) \tag{3}: A \subset S
\end{align}`

???

- The function is arbitrary, it can take on any form
- There are myriad distributions
- We will look at specific examples

---

# Probability .amber[Mass] Function

.font2[
- Binomial: `$X \sim B(n, p)$`
- Geometric: `$X \sim G(p)$`
- Poisson: `$X \sim P(\lambda)$`
]

.font2[.amber[Similarity:] All describes .amber[discrete] random variables following  Bernoulli trials]

???

Bernoulli trials:
- All instances are independent
- All instances have identical probability
- Example: A coin flip, sampling with replacement

---

# Binomial distribution

---

# Geometric distribution

---

# Poisson distribution

---

# Probability .amber[Density] Function

.font2[
- Gamma: `$X \sim \Gamma(\alpha, \beta)$`
- Exponential: `$X \sim Exponential(\lambda)$`
- `$\chi^2$`: `$X \sim \chi^2(\nu)$`
- Normal: `$X \sim N(\mu, \sigma)$`
]

---

# Gamma distribution

---

# Exponential distribution

---

# `$\chi^2$` distribution

---

# Normal distribution

---

---

---

# Central Limit Theorem

???

`$\xrightarrow{d}$` is a convergence of random variables

.font2[
- So far, we have learnt sampling distributions
- We are also able to compute the mean and standard deviation based on their parameters
- It just happened that the sample mean follow a normal distribution
- .amber[Central limit theorem] delineates such an occurrence
- This rule applies to both .amber[discrete] and .amber[continuous] distribution
]

---

# Linear Model

- Estimating `$y$` as `$f(x)$`
- Assumptions:
  - Linearity between `$y$` and `$x_i$` at any index `$i$`
  - Homoskedasticity
  - Non-existent multicollinearity
  - Normality of residual `$\epsilon$`
]

---

# Generalized Linear Model

- Estimating the output of `$g(y)$` as `$f(x)$`
- Assumptions:
  - Linearity between `$g(y)$` and `$f(x)$`
  - Non-existent multicollinearity
  - Homoskedasticity (more lenient compared to LM)
]

---

# What is `$g(y)$`?

.font2[
- A link function
- Basically it transform `$y$` into a continuous linear value
- GLM with identity link function = LM
- Depends on what distribution `$y$` comes from
]

???

- Binomial distribution: `logit` and `probit` (also termed a logistic regression)
- Poisson distribution: `log` (log-linear regression), `identity`, `sqrt`
- Other distro exists: inverse gaussian, cauchy, gamma, etc.

---

# Regression in statistics

.font2[
- Depends on which distribution `$y$` is coming from
- Ordinary Least Squares determines the value of each parameter
- Maximum Likelihood Estimator `$\to$` slope and intercept `$\beta_0$`
- Variate-Covariate matrix determines standard error `$\to$` calulate p-value
]