+ - 0:00:00
Notes for current slide
Notes for next slide

Mathematics for Machine Learning

Aly Lamuri
Department of Medical Physics

1 / 31

Overview

  • Linear algebra
  • Calculus
  • Probability and statistics
1 / 31

Linear algebra

  • A set of rules to manipulate mathematical objects
  • Operates in a defined spaces
  • Defines a relationship between multiple variables
1 / 31

Object in math depends on which branch it is defined upon, in linear algebra:

  • Scalar
  • Vector
  • Matrix
  • Tensor

Spaces:

  • Vector
  • Metric
  • Normed
  • Inner product

Linear algebra

  • A set of rules to manipulate mathematical objects
  • Operates in a defined spaces
  • Defines a relationship between multiple variables

How does it look like?

αx+βy=c

1 / 31

Linear algebra

  • A set of rules to manipulate mathematical objects
  • Operates in a defined spaces
  • Defines a relationship between multiple variables

Geometry and algebra?

  • Euclidean space (flat)
  • Non-euclidean space (added curvature)
1 / 31
  • We often assume flat euclidean space
  • Violation on any of Euclid's postulate results in non-euclidean space
  • Euclidean approximation does not generalize well to non-euclidean space

Euclid's five postulates

  • Joining two points straight line
  • Straight lines indefinitely extends as is
  • Rotating a straight line a circle
  • All right angles are perpendicular
2 / 31

When rotating a straight line on one endpoint, it will form a circle, where the length of such a straight line equates to the radius

Where's the fifth one?

The fifth postulate

If two lines are drawn which intersect a third in such a way that the sum of the inner angles on one side is less than two right angles, then the two lines inevitably must intersect each other on that side if extended far enough

3 / 31

The fifth postulate

If two lines are drawn which intersect a third in such a way that the sum of the inner angles on one side is less than two right angles, then the two lines inevitably must intersect each other on that side if extended far enough

Also known as the parallel postulate

3 / 31

The fifth postulate

If two lines are drawn which intersect a third in such a way that the sum of the inner angles on one side is less than two right angles, then the two lines inevitably must intersect each other on that side if extended far enough

Also known as the parallel postulate

3 / 31
  • Sounds reasonable
  • But only on a flat surface
  • What if we embed such a surface into curvatures?

What do we use Euclidean space for?

  • Modelling our real world :)
  • Make an approximation we can comprehend
  • Estimate a distance between two points
4 / 31

What do we use Euclidean space for?

  • Modelling our real world :)
  • Make an approximation we can comprehend
  • Estimate a distance between two points

Example in machine learning, please

  • Hierarchical clustering
  • Principal component analysis
  • And many more to jump on the bandwagon!
4 / 31

Hierarchical clustering

4 / 31
  • In hierarchical clustering, first we determine the distance between two data points to determine their closeness
  • Closer distance = higher similarity

PCA

4 / 31
  • PCA maximizes the spaces in reduced dimension (e.g. from a two dimension to one, it will calculate reduced dimension which variances is the largest)

A violation of Euclid's postulate?

5 / 31
  • Let X be a curved surface, embedded by a two-dimensional euclidean space
  • None of the straight line will follow Euclid's postulates
  • Hence: non-euclidean space

A violation of Euclid's postulate?

5 / 31
  • Let X be a curved surface, embedded by a two-dimensional euclidean space
  • None of the straight line will follow Euclid's postulates
  • Hence: non-euclidean space

A violation of Euclid's postulate?

Okay, geometry is cool and all

5 / 31
  • Let X be a curved surface, embedded by a two-dimensional euclidean space
  • None of the straight line will follow Euclid's postulates
  • Hence: non-euclidean space

A violation of Euclid's postulate?

Okay, geometry is cool and all

But, what does it have to do with machine learning?

5 / 31
  • Let X be a curved surface, embedded by a two-dimensional euclidean space
  • None of the straight line will follow Euclid's postulates
  • Hence: non-euclidean space

Caveats in (some) machine learning models

  • We assume our data lives in a flat mathematical space
  • High dimensionality does not play well with Euclid's postulates
  • Some address this limitation by developing a non-euclidean algorithm
  • This course will mostly consider an Euclidean space
6 / 31
  • Curse of dimensionality: phenomena occurring only in high-dimensional data
  • More dimension requires more data

Back to linear algebra

  • We have seen how algebra relates with geometry
  • We know some limitations we have when performing algebraic operations
  • Now we can safely proceed to understand the role of algebra in machine learning!
7 / 31

Back to linear algebra

  • We have seen how algebra relates with geometry
  • We know some limitations we have when performing algebraic operations
  • Now we can safely proceed to understand the role of algebra in machine learning!

Quick recall: spaces in linear algebra

  • Vector
  • Metric
  • Normed
  • Inner product
7 / 31

Vector space

  • Governs multiplication and addition
  • Produce objects of the same classes
  • What are classes?
  • Defining a system of linear equations
8 / 31

Classes:

  • Geometric vector: what we have learnt during high school
  • Polynomial
  • Signals
  • Tuple: what we mainly concern ourselves with in machine learning

System of linear equation

Let X=(x1,x2,...,xn)A=(α1,α2,...,αm):A,XR, thenf(x)=1j1iαijxi=Π

9 / 31
  • Let there be a sequence of vector X and scalar A, there exists a function defining the sum of product between x and α

Example, please?

4x1+4x2=5(+)2x14x2=16x1=6

10 / 31

Example, please?

4x1+4x2=5(+)2x14x2=16x1=6

x1=1x2=0.25

10 / 31

What do we learn?

  • System of linear algebra is not complicated
  • It is intuitive to solve
  • At a glance, we can see a pattern of their solution
  • But a computer does not rely on intuition
  • How do we teach a computer to solve them?
11 / 31

What do we learn?

  • System of linear algebra is not complicated
  • It is intuitive to solve
  • At a glance, we can see a pattern of their solution
  • But a computer does not rely on intuition
  • How do we teach a computer to solve them?

Solution: use matrices

11 / 31

Re-formulate our solution as a matrix

[4424][x1x2]=[51]Let A:=[4424]A[x1x2]=[51]A1A[x1x2]=A1[51]

12 / 31

Calculate an inverse matrix A1

Recall that AA1=I [44102401] Starts by having the original and identity matrix side by side

12 / 31

Calculate an inverse matrix A1

Add the second row into the first [60112401]

12 / 31

Calculate an inverse matrix A1

Divide the first row by 6 [1016162401]

12 / 31

Calculate an inverse matrix A1

Divide the second row by 2 [10161612012]

12 / 31

Calculate an inverse matrix A1

Subtract the first row from the second [101616021613]

12 / 31

Calculate an inverse matrix A1

Divide the second row by -2 [1016160111216] Now you have I on the left and A1 on the right :)

12 / 31

Calculate an inverse matrix A1

  • We call this method an elementary row operation
  • Basically you can multiply your rows by any scalar
  • But you can only subtract / add one row with the other
  • Swapping the row is perfectly fine though
12 / 31

This is one of the simplest method to get A1 in m×m matrices

Solving the equation

A1A[x1x2]=A1[51][x1x2]=[161611216][51][x1x2]=[114]

12 / 31

Why using matrices?

  • Computers understand matrices as a mathematical construct
  • They can easily perform mathematical operation
  • Matrix is generalizable into a higher dimensional object, i.e. a tensor
  • Dot product tensor product
13 / 31

Why using matrices?

  • Computers understand matrices as a mathematical construct
  • They can easily perform mathematical operation
  • Matrix is generalizable into a higher dimensional object, i.e. a tensor
  • Dot product tensor product

Hardware-wise

  • CPU: General arithmetic operation
  • GPU: Matrix manipulation
  • TPU: Tensor manipulation
13 / 31

Are all matrices invertible?

  • No. Sorry to disappoint :(
  • Non-invertible matrix: regular / non-singular
  • Only invertible matrices provide unique solutions
14 / 31

Metric space

  • A space where we consider the distance between two data points
  • Has a pair of objects, one for the set another for the distance
  • Some specific properties?

d:X×X[0,)

15 / 31
  • X is a set of real numbers

Metric space

Let (x,y)Xd(x,y)=0x=yd(x,y)0xyd(x,y)=d(y,x)d(x,y)d(x,z)+d(z,y)

15 / 31
  • Let a pair (x,y) as a part of set X
  • Distance function d may vary

Distance function

  • Different functions exist!
  • We need a measure to define how dissimilar our data points are
16 / 31
  • Euclidean
  • Manhattan
  • Minkowski

Distance function

  • Different functions exist!
  • We need a measure to define how dissimilar our data points are

Euclidean distance

D=i=1n(piqi)2

  • Measure the shortest distance in an Euclidean space
  • Most commonly used distance measure
16 / 31

n is the number of dimension

Distance function

  • Different functions exist!
  • We need a measure to define how dissimilar our data points are

Manhattan distance

D=i=1n|piqi|

  • An absolute distance in an Euclidean space
  • Optimally used in a higher dimension data
16 / 31

Distance function

  • Different functions exist!
  • We need a measure to define how dissimilar our data points are

Minkowski distance

D=(i=1n|piqi|p)1p

  • Useful in a non-euclidean space
  • Not a part of metric space, but still nice to know
16 / 31
  • p: Order of the norm
  • A more general form of the former two distance measures
  • When p=1, it behaves as a Manhattan distance
  • When p=2, it behaves as an Euclidean distance
  • Fun fact: Einstein's special relativity used Minkowski distance

Comparison on distance measures

16 / 31

Normed linear space

  • Non-negative map on the vector space
  • Formalization of the length of x
  • Utilized norm: a real-valued function
17 / 31

Example: Hilbert space, Lp space

Normed linear space

  • Non-negative map on the vector space
  • Formalization of the length of x
  • Utilized norm: a real-valued function

Inner product space

  • Basically a vector space with inner product
  • Mapping two vectors into a scalar
  • Inner product: a generalization of dot product
17 / 31

Example: Hilbert space, Lp space

Imagine inner product as a dot product without dimensionality constraint

Overview

  • Linear algebra
  • Calculus
  • Probability and statistics
17 / 31

Summary:

  • Linear algebra: a set of mathematical operations over specific spaces
  • Vector spaces: we deal a lot with matrices and tensors
  • Application: direct implementation is gradient descent
  • Algebra deals with relations

Calculus

  • Optimizing machine learning algorithm
  • Minimizing cost function
  • Finding extrema: minima or maxima?
  • Determine the weight in our model
18 / 31
  • Cost function: akin to ϵ when performing regression
  • Lower ϵ lower randomness

Gradient

Partial derivatives generalization to find extremum

19 / 31

Gradient

f=[fx1fxn]

19 / 31

Overview

  • Linear algebra
  • Calculus
  • Probability and statistics
19 / 31

Probability and statistics

P(A|B)=P(A)P(B)

  • Both frequentist and Bayesian statistics revolve around conditional probability
  • Only Bayesian statistics allows updating prior belief on probability function
  • But, what is statistics in essence?
20 / 31

Population

All observable subjects inhabiting a certain location


21 / 31

Population

All observable subjects inhabiting a certain location


Parameters

Quantitative summary of a population


21 / 31

Population

All observable subjects inhabiting a certain location


Parameters

Quantitative summary of a population


Be more specific, please?

21 / 31

Population

All observable subjects inhabiting a certain location


Parameters

Quantitative summary of a population


Be more specific, please?

Notation and meanings

X: Data element
N: Number of element
p: Proportion
M: Median
μ: Average
σ: Standard deviation
σ2: Variance
ρ: Correlation coefficient

21 / 31

Sample

A subset of an observable population


22 / 31

Sample

A subset of an observable population


Statistics

Quantitative summary of a sample


22 / 31

Sample

A subset of an observable population


Statistics

Quantitative summary of a sample


Be more specific, please?

22 / 31

Sample

A subset of an observable population


Statistics

Quantitative summary of a sample


Be more specific, please?

Notation and meanings

x: Data element
n: Number of element
p^: Proportion
m: Median
x¯: Average
s: Standard deviation
s2: Variance
r: Correlation coefficient

22 / 31

Probability

  • An event E occurring within a particular sample space S
  • Event: Expected results
  • Sample space: All possible outcomes
  • Probability P is a proportion of event divided by its sample space
  • Or mathematically:

P(E=e)=ES

23 / 31

Random Variables

Independent vs Identical? I.I.D

24 / 31

Random Variables

Independent vs Identical? I.I.D

  • All sampled random variables should be independent from one another
  • Each sampling procedure have to be identical, as to produce similar probability
24 / 31

Random Variables

Independent vs Identical? I.I.D

  • All sampled random variables should be independent from one another
  • Each sampling procedure have to be identical, as to produce similar probability

Considering I.I.D, can we do a better probability estimation?

24 / 31

Random Variables

Independent vs Identical? I.I.D

  • All sampled random variables should be independent from one another
  • Each sampling procedure have to be identical, as to produce similar probability

Considering I.I.D, can we do a better probability estimation?

  • If they are I.I.D, we can approximate the probability using:
    • Probability Mass Function (discrete variable)
    • Probability Density Function (continuous variable)
24 / 31

Random Variables

Independent vs Identical? I.I.D

  • All sampled random variables should be independent from one another
  • Each sampling procedure have to be identical, as to produce similar probability

Considering I.I.D, can we do a better probability estimation?

  • If they are I.I.D, we can approximate the probability using:
    • Probability Mass Function (discrete variable)
    • Probability Density Function (continuous variable)

In math, please?

(1)P(E=e)=f(e)>0:ES(2)eSf(e)=1(3)P(EA)=eAf(e):AS

24 / 31
  • The function is arbitrary, it can take on any form
  • There are myriad distributions
  • We will look at specific examples

Probability Mass Function

  • Binomial: XB(n,p)
  • Geometric: XG(p)
  • Poisson: XP(λ)
25 / 31

Probability Mass Function

  • Binomial: XB(n,p)
  • Geometric: XG(p)
  • Poisson: XP(λ)

Similarity: All describes discrete random variables following Bernoulli trials

25 / 31

Bernoulli trials:

  • All instances are independent
  • All instances have identical probability
  • Example: A coin flip, sampling with replacement

Binomial distribution

Key question: How many events to have given a certain probability n?

25 / 31

Geometric distribution

Key question: How many failures before getting an event?

25 / 31

Poisson distribution

Key question: What is the chance of having n events given a λ rate?

25 / 31

Probability Density Function

  • Gamma: XΓ(α,β)
  • Exponential: XExponential(λ)
  • χ2: Xχ2(ν)
  • Normal: XN(μ,σ)
26 / 31

Probability Density Function

  • Gamma: XΓ(α,β)
  • Exponential: XExponential(λ)
  • χ2: Xχ2(ν)
  • Normal: XN(μ,σ)

Similarity: All describes continuous random variables

26 / 31

Gamma distribution

26 / 31

Exponential distribution

26 / 31

χ2 distribution

26 / 31

Normal distribution

26 / 31

Normal distribution

26 / 31

Central Limit Theorem

X¯dN(μ,σn)as n

27 / 31

d is a convergence of random variables

Central Limit Theorem

X¯dN(μ,σn)as n

  • So far, we have learnt sampling distributions
  • We are also able to compute the mean and standard deviation based on their parameters
  • It just happened that the sample mean follow a normal distribution
  • Central limit theorem delineates such an occurrence
  • This rule applies to both discrete and continuous distribution
27 / 31

d is a convergence of random variables

Linear Model

y=β0+i=1nβixi+ϵ

  • Estimating y as f(x)
  • Assumptions:
    • Linearity between y and xi at any index i
    • Homoskedasticity
    • Non-existent multicollinearity
    • Normality of residual ϵ
28 / 31

Generalized Linear Model

g(y)=β0+i=1nβixi+ϵ

  • Estimating the output of g(y) as f(x)
  • Assumptions:
    • Linearity between g(y) and f(x)
    • Non-existent multicollinearity
    • Homoskedasticity (more lenient compared to LM)
29 / 31

What is g(y)?

  • A link function
  • Basically it transform y into a continuous linear value
  • GLM with identity link function = LM
  • Depends on what distribution y comes from
30 / 31
  • Binomial distribution: logit and probit (also termed a logistic regression)
  • Poisson distribution: log (log-linear regression), identity, sqrt
  • Other distro exists: inverse gaussian, cauchy, gamma, etc.

Regression in statistics

  • Depends on which distribution y is coming from
  • Ordinary Least Squares determines the value of each parameter
  • Maximum Likelihood Estimator slope and intercept β0
  • Variate-Covariate matrix determines standard error calulate p-value
31 / 31

Overview

  • Linear algebra
  • Calculus
  • Probability and statistics
1 / 31
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow