Mathematics for Machine Learning
Mathematics is the heart of machine learning algorithms. We communicate our problem and solution as formal logic, in a way that the computer can understand. The scope of mathematics in machine learning is broader than what this lecture may have captured. We shall focus on a few branches in mathematics, including linear algebra, calculus, probability and statistics. The current post has no intention to provide an in-depth review of mathematical concepts. Instead, it will serve as a gentle introduction to excite you in learning more about mathematics.
Linear Algebra
As a branch of mathematics, linear algebra sets rules in manipulating mathematical objects within a defined set. Objects in linear algebra involve a scalar, vector, matrix and tensor. Like other branches in mathematics, linear algebra governs its own space, which includes a vector, metric, normed line and inner product space. By defining rules to operate on an object, linear algebra specify a relationship between multiple variables. Variables are observable features of our interest, which we will refer to as a dimension from this point onward. We may represent linear algebra in the following simple equation:
$$\alpha x + \beta y = c$$
Please be advised, we often assume flat Euclidean space when constructing algebraic rules. However, violation on any of Euclid’s postulates may happen in a non-Euclidean geometry. In such cases, Euclidean approximation does not generalize well to non-Euclidean space. As a quick reminder, Euclid proposed five postulates in defining a straight line:
- A straight line segment can be drawn joining any two points.
- Any straight line segment can be extended indefinitely in a straight line.
- Given any straight lines segment, a circle can be drawn having the segment as radius and one endpoint as center.
- All Right Angles are congruent.
- If two lines are drawn which intersect a third in such a way that the sum of the inner angles on one side is less than two Right Angles, then the two lines inevitably must intersect each other on that side if extended far enough. This postulate is equivalent to what is known as the Parallel Postulate.
The first four postulate seems ready to digest and does not violate our preconception on straight lines. However, the fifth one is a bit horrendous that we need to carefully read them as to not misinterpret the idea. We can paraphrase the fifth postulate, in which it defines a behaviour seen in two parallel lines.
Now, all of them sounds reasonable, but do they? Unfortunately, Euclid’s postulates only prevail on a flat surface. With added curvature, i.e. in a non-Euclidean geometry, we shall observe violations on at least one of them.
Still, Euclidean space is useful as a model of our real world, where we can make an approximation of distances between two data points. Most of well-established machine learning algorithms use Euclidean space in computing their metrics of distance. Hierarchical clustering and principal component analysis are to name but a few, as there are still others to jump on the bandwagon!
Due to the constraint in Euclidean space, data with higher dimension will suffer from phenomenons unseen in lower dimension, as we call them the curse of dimensionality. Generally speaking, more dimensions requires a higher volume of data. Some address this limitation by developing a non-Euclidean algorithm, e.g. the geometric deep learning. To keep the post brief, we will only consider an Euclidean space when creating our model. We have seen how algebra relates with geometry, and we know what their limitation is. Now, we can safely proceed to comprehend the role of algebra in machine learning :)
Vector Space
In a vector space, algebraic rules govern multiplication and additions, resulting in objects of the same class. During high school, we may have seen a few examples on geometric vector, for instances those we recognized from physics. We know a vector has its magnitude and direction, where opposing vectors of the same magnitude abolish each other. In mathematics, the classes are more general, where we can have polynomial, signals and tuples. In essence, the vector space considers sequences as an object to manipulate. To bring it into context, machine learning will have tuples as their object in the vector space. In a system of linear algebra, we define:
\begin{align} Let\ X &= (\mathbf{x_1}, \mathbf{x_2}, ..., \mathbf{x_n}) \\ A &= (\alpha_1, \alpha_2, ..., \alpha_m) : \forall A, X \in \mathbb{R},\ then \\ \exists f(x) &= \displaystyle \sum_1^j \sum_1^i \alpha_{ij} \cdot \mathbf{x}_i = \Pi \end{align}
It looks horrible! But rest easy, we can read it this way:
Let there be a sequence of vector X and scalar
\(A\)
, there exists a function defining the sum of product between x and\(\alpha\)
Now, they do not seem as bad as they were :) We can simplify the concept using a quick example:
\begin{align} 4x_1 + 4x_2 &= 5 \\ 2x_1 - 4x_2 &= 1 \tag{+} \\ \hline 6x_1 &= 6 \\ \end{align}
\begin{align} \therefore x_1 & = 1 \\ x_2 & = 0.25 \end{align}
I mentioned quick, but I did not mean to be this short! Well, that is partially because the fundamentals of linear algebra is not a complicated concept. It is intuitive to solve, even at a glance, we can see an emerging pattern from our set of equations. However, a computer does not rely on intuition. As a solution, we teach a computer to solve algebraic problem using matrices. If we were to re-formulate our problem and solution as a matrix, we can present them in the following fashion:
\begin{align} \begin{bmatrix} 4 & 4 \\ 2 & -4 \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &= \begin{bmatrix} 5 \\ 1 \end{bmatrix} \\ Let\ A &:= \begin{bmatrix} 4 & 4 \\ 2 & -4 \end{bmatrix} \\ A \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &= \begin{bmatrix} 5 \\ 1 \end{bmatrix} \\ A^{-1} \cdot A \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &= A^{-1} \begin{bmatrix} 5 \\ 1 \end{bmatrix} \\ \end{align}
What we need to accomplish now is finding the appropriate inverse matrix
\(A^{-1}\)
. Recall that \(A \cdot A^{-1} = I\)
, with \(I\)
being and identity matrix.
We will start solving the \(A^{-1}\)
by having the original and identity matrix
side by side. At the end of our procedures, we expect to turn the original
matrix into an identity matrix. Since we will do a row-wise operations, any
changes will occur on both sides. Meaning that, by turning the original into
identity matrix, we will directly change the identity matrix \(I\)
(right side)
into an inverse matrix \(A^{-1}\)
.
\begin{align} \left[\begin{array}{cc|cc} 4 & 4 & 1 & 0 \\ 2 & -4 & 0 & 1 \end{array}\right] \end{align}
Add the second row into the first
\begin{align} \left[\begin{array}{cc|cc} 6 & 0 & 1 & 1 \\ 2 & -4 & 0 & 1 \end{array}\right] \end{align}
Divide the first row by 6
\begin{align} \left[\begin{array}{cc|cc} 1 & 0 & \frac{1}{6} & \frac{1}{6} \\ 2 & -4 & 0 & 1 \end{array}\right] \end{align}
Divide the second row by 2
\begin{align} \left[\begin{array}{cc|cc} 1 & 0 & \frac{1}{6} & \frac{1}{6} \\ 1 & -2 & 0 & \frac{1}{2} \end{array}\right] \end{align}
Subtract the first row from the second
\begin{align} \left[\begin{array}{cc|cc} 1 & 0 & \frac{1}{6} & \frac{1}{6} \\ 0 & -2 & - \frac{1}{6} & \frac{1}{3} \end{array}\right] \end{align}
Divide the second row by -2
\begin{align} \left[\begin{array}{cc|cc} 1 & 0 & \frac{1}{6} & \frac{1}{6} \\ 0 & 1 & \frac{1}{12} & - \frac{1}{6} \end{array}\right] \end{align}
Now you have \(I\)
on the left and \(A^{-1}\)
on the right :) We call this method
an elementary row operation. With this method, you can multiply a
particular row using any scalar. However, you may not perform addition nor
subtraction using a constant. Subtraction and addition are permissible if you
do such an operation between rows, e.g. subtracting the first row from the
second one. For the record, swapping the row is perfectly fine though. The
computation may seem daunting, but it is one of the simplest and most explicit
methods to solve inverses in a \(m \times n\)
matrices. Another method exists,
and I will leave that to your volition to make a venture. Since we have our
inverse matrix \(A^{-1}\)
, we can solve our problem:
\begin{align} A^{-1} \cdot A \cdot \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &= A^{-1} \begin{bmatrix} 5 \\ 1 \end{bmatrix} \\ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &= \begin{bmatrix}\frac{1}{6} & \frac{1}{6} \\ \frac{1}{12} & - \frac{1}{6}\end{bmatrix} \cdot \begin{bmatrix} 5 \\ 1 \end{bmatrix} \\ \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} &= \begin{bmatrix} 1 \\ \frac{1}{4} \end{bmatrix} \end{align}
It took longer than expected, why do we need to use matrices? Well, just to be fair, computers understand matrices as a mathematical construct. They can easily perform mathematical operation, faster than what most of us can keep up with. Also, matrix is generalizable into a higher dimensional object, i.e. a tensor. A dot product in matrices is akin to inner product (or tensor product, if you will) in tensors. But I have to disappoint you with the fact that not all matrices are invertible :( Only invertible matrices provide a unique solution for our system of linear equation.
Metric Space
Still related to the vector space, a metric space is a place where distances between two data points live in. The metric space has a pair of set and distance object, where they exhibit particular properties. We denote the metric space as:
\begin{align} d: X \times X \to [0, \infty) \end{align}
With \(X\)
being a set of real number. Some properties of a distance \(d(x, y)\)
we
can observe in the metric space are:
\begin{align} Let\ (x, y) & \in X \\ d(x, y) &= 0 \iff x = y \\ d(x, y) & \geqslant 0 \iff x \neq y \\ d(x, y) &= d(y, x) \\ d(x, y) & \leqslant d(x, z) + d(z, y) \end{align}
In determining the distance \(d(x,y)\)
, we can use different approaches based on
what we need to accomplish.
Euclidean Distance
$$D = \sqrt{\displaystyle \sum_{i=1}^n (p_i - q_i)^2}$$
- Measure the shortest distance in an .amber[Euclidean] space
- Most commonly used distance measure
\(n\)
is the number of dimension
Manhattan Distance
$$D = \displaystyle \sum_{i=1}{n} |p_i - q_i|$$
- An absolute distance in an .amber[Euclidean] space
- Optimally used in a higher dimension data
Minkowski Distance
$$D = \bigg( \displaystyle \sum_{i=1}^n |p_i - q_i|^p \bigg)^{\frac{1}{p}}$$
- Useful in a .amber[non-euclidean] space
- Not a part of metric space, but still nice to know
\(p\)
: Order of the norm- A more general form of the former two distance measures
- When
\(p=1\)
, it behaves as a Manhattan distance - When
\(p=2\)
, it behaves as an Euclidean distance
Comparison on Distance Measures
Normed Linear Space
A normed linear space is a non-negative map on the vector space, where it
formalizes the length of \(x\)
. Using a norm, a real-valued function, it
approximates a real-world length of \(x\)
as measured from the starting point
\(0\)
. An example of normed linear space includes the Hilbert space and \(L^p\)
space.
Inner Product Space
In essence, the inner product space introduces inner products into a vector space (duh!). It maps two vectors into a scalar, where we define the inner product as a generalization of a dot product. Please kindly recall our discussion on tensor products. So, we can imagine an inner product as a dot product without dimensionality constraint
Calculus
The role of calculus in machine learning is to optimize the modelled algorithm. Calculus can minimize the cost function by adjusting weights used in the model. It aims to find extremum, either as a minima or maxima. Finding the extremum highly depends on which cost function assigned to assess the model. In general, exists are two type of cost function, i.e. a convex and concave function. We are looking for a minima in a convex function and a maxima in a concave function. In doing so, we are performing a gradient function, i.e. a set of partial derivations to find the optimal values.
\begin{align} \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \end{align}
Probability and Statistics
$$P(A|B) = P(A) \cdot P(B)$$
Both frequentist and Bayesian statistics revolve around conditional probabilities. The only difference is, Bayesian statistics allows us to update our prior belief on the probability distribution. But, it is pivotal to understand the question: what is statistic in essence? Statistics is an approximation of parameters in population, which we extensively discussed in the biostatistic course.