TubeSum ← Transcribe a video

Video uZeDTwWcnuY

0h 41m video Transcribed May 27, 2026 Watch on YouTube ↗
Intermediate 20 min read For: Programmers and data scientists with basic math background who want to understand the role of linear algebra in machine learning.

AI Summary

This video explains why linear algebra is foundational to machine learning, framing it as the mathematics of arrays and optimization. It introduces key concepts like matrix multiplication, linear functions, and the singular value decomposition (SVD), emphasizing a programming-centric perspective.

[00:13]
Machine learning is programming by optimization

ML uses optimization to improve programs, requiring math to understand what is being optimized and how.

[01:00]
Three math pillars for ML

Linear algebra (objects being optimized), calculus (how to optimize), and probability/statistics (what to optimize).

[02:14]
Linear algebra defined as mathematics of arrays

Arrays represent data, models, and computations in ML; matrix multiplication is the core operation.

[03:28]
Matrix multiplication powers many algorithms

Examples include dot products, PCA, neural network layers, convolutions, and even GPU acceleration.

[05:50]
Linear algebra is not like traditional algebra

It's better understood geometrically (3Blue1Brown) or via programming (matrices as functions, multiplication as composition).

[08:00]
Programming analogy for linear algebra

Matrices are functions, shapes are types, and matrix multiplication is function composition.

[14:58]
Why arrays are good for optimization

Small changes to matrix entries yield small output changes, enabling gradient-based optimization.

[18:08]
Definition of linear functions

Two rules: f(a+b)=f(a)+f(b) and f(λa)=λf(a); they respect weighted sums.

[22:51]
SVD as matrix refactoring

SVD decomposes a matrix into three pieces (surjection, bijection, injection), analogous to refactoring code.

[34:05]
Low-rank approximation via SVD

Used for compression (e.g., JPEG) and foreground-background separation in video.

Linear algebra is crucial for ML because it provides a smooth, optimizable representation of functions via arrays. The SVD is a key tool for understanding and simplifying matrices, much like refactoring in programming.

Clickbait Check

95% Legit

"Title accurately reflects content: the video thoroughly explains why linear algebra matters for ML."

Mentioned in this Video

Study Flashcards (8)

What is the core operation in linear algebra?

easy Click to reveal answer

Matrix multiplication.

02:42

What are the two rules for a function to be linear?

medium Click to reveal answer

f(a+b)=f(a)+f(b) and f(λa)=λf(a).

18:08

How does the programming analogy view matrices?

medium Click to reveal answer

Matrices are functions, shapes are types, and matrix multiplication is function composition.

08:00

What is the kernel of a linear function?

hard Click to reveal answer

The set of inputs that are sent to zero.

20:47

What does the rank of a matrix measure?

hard Click to reveal answer

How many linearly independent outputs (non-kernel elements) are needed to describe all outputs.

21:52

What is the SVD analogous to in programming?

hard Click to reveal answer

Refactoring a program into three pieces: surjection, bijection, injection.

22:51

Why are arrays good for optimization?

medium Click to reveal answer

Small changes to entries yield small output changes, enabling gradient-based methods.

17:29

What is a low-rank approximation?

medium Click to reveal answer

Approximating a matrix by one with lower rank, often used for compression or denoising.

34:05

🔥 Best Moments

💡

Linear algebra is not like algebra

Challenges traditional teaching by emphasizing geometric and programming perspectives over equation manipulation.

05:50
💡

Matrices as functions, multiplication as composition

Provides a powerful, intuitive analogy that bridges linear algebra and programming.

08:00
🤯

SVD as matrix refactoring

Elegantly connects a core linear algebra concept to a familiar software engineering practice.

22:51

Full Transcript

Download .txt

[00:00] So, at a very high level, what does math have to do with machine learning? Like, why math for

[00:13] machine learning among all the things that we could be talking about? And all programming involves mathematics at some level. Mathematics and programming and computers have been tied together since the inception of computer science and programming. Machine learning in particular,

[00:28] though, is programming by optimization. The way that we program computers to do things in machine learning is through optimization. And in order to understand optimization, in order to understand

[00:42] what it is that we're optimizing and why and how that works, we need mathematics. And this is what makes machine learning such a more mathematical discipline of programming than something like web development or database management or any of the other different branches of programming

[01:00] that are popular these days. And so in this series, we're going to cover several of the ways that optimization and machine learning intersect or interact. Linear algebra, which will help us understand the objects being optimized.

[01:16] Calculus, which will help us understand how it is that we optimize those things. And then probability and statistics, which will help us understand what that thing is that we are optimizing, that we are making better.

[01:29] So we'll start with that section on linear algebra, which is to understand the objects that are being optimized. So the takeaways for today are that linear algebra is important and why it's so important.

[01:42] That linear algebra, despite its name, is actually not that much like algebra. and that the FDD or singular value decomposition is a matrix version of refactoring a common technique in programming.

[01:57] First, linear algebra is important. Why do we care about linear algebra? First-pass definition is that linear algebra is the mathematics of arrays. An array is a collection of numbers, a sort of block-like collection of numbers.

[02:14] Those numbers are typically, usually we think of them as being floating point numbers in a computer or real numbers in mathematics. And lots of objects are represented by arrays in machine learning.

[02:26] Our data is going to be in an array. There will be our inputs and our outputs. If our inputs are images, there are going to be arrays of numbers that represent that image. Very often, our models, which are like the equivalents of our programs in machine learning, will be represented by arrays or by collections of arrays.

[02:42] And then the internal computations even of those models will be represented by arrays as well. And so the core operation in linear algebra, the core operation that we use to manipulate those arrays is matrix multiplication.

[02:57] So a couple of examples of matrix multiplication in action. You may have heard of the dot, scalar, or inner product, the correlation or the covariance in statistics, linear and logistic regression, ideas from classical machine learning, principal components analysis,

[03:11] Also a bunch of useful algorithms outside of what you would think of as machine learning. Discrete Fourier transform, page rank, then things in machine learning again, hidden layers of neural networks, convolutions, the second order methods for optimization, Newton and LBFDS.

[03:28] Nature's full location is essentially the operation at the core of every single one of these examples. And in the end, GPUs, you may have heard of the importance of GPUs, graphics processing units.

[03:41] GDUs are a huge part of what drove this huge explosion in machine learning recently. And they were originally designed for video games, and they made video games faster and better. But they ended up also making machine learning faster and better because underneath it all,

[03:57] both graphics rendering and machine learning require linear algebra. And that matrix multiplication is surprisingly simple, despite how many things it can be used for. So we have our arrays of numbers. So we have one array here on the left, another on the right, that's on the left side of the equal sign here,

[04:14] and that's going to result in the output of another array. So if I take two matrices and multiply them, I take the rows of the first matrix and the columns of the second matrix, and I combine those together to get the elements of the output.

[04:28] What I do is I take each number in this row and each number in this column, and I multiply them together. That gives me a whole bunch of numbers, and then I add them up, and that gives me this final number here. I reduce them down with this sum operation.

[04:42] So mathematically, that might be written that if I were to matrix multiply x and y together here and look at the i's row and the j's column, then what I'm going to do is take the i's row of x, the j's column of y,

[04:54] and multiply their entries and sum it up. And if I want to get all the entries of that output of that matrix multiplication of x and y, I just do that for every row and for every column. The matrix of whole partition takes two matrices where one has the same number of rows as the other has columns

[05:09] and turns them into a single output with the same number of rows as the first one and the same number of columns as the second one. It's relatively simple given how many things it's capable of doing, but it's also kind of a weird and specific rule that if I just told you,

[05:24] oh, here's how we're going to combine collections of numbers, there's no obvious reason why you would do it that way. And from here, I think a lot of linear algebra tutorials would just jump straight into algebra. They would give algebraic definitions in terms of equations of a whole bunch of key concepts

[05:38] and then start manipulating those equations, manipulating matrices. And I think this is not the right path to take here, because I think it doesn't give you a motivation for why we're doing that rule and how you should really think about linear algebra

[05:50] rather than thinking that it's manipulating a bunch of equations. Let's dive into this first of our major concepts that linear algebra is not like algebra. Linear algebra is really often taught like algebra, first discovered in this world where

[06:04] people were really concerned with solving equations. And linear algebra kind of looks like algebra because it's got addition and it's got multiplication and there's rules for combining them. But it looks a lot like algebra. And lots of linear algebra classes spend their time on just those computations and manipulations,

[06:21] determinants, eigenvalues, things like that. And folks end up getting pretty confused because the rules are complicated and reasoning with equations is pretty hard and unintuitive. It took me a very long time before I became comfortable just doing that kind of algebraic manipulation with linear algebra.

[06:36] But that's not the only way to understand linear algebra. There's lots of other ways to look at it that are more fruitful. And so one of them is a sort of geometric view of linear algebra that I think is best put forward by this YouTube series by the YouTuber 3Blue1Brown,

[06:51] formerly of Kahn Academy, The Essence of Linear Algebra. And that views linear algebra as the study of linear transformations of spaces, of transformations, of rotations, reflections, and scalings of a grid.

[07:05] And so that's one interesting way of looking at it, and I recommend you check out that YouTube series. It's very beloved. There's also other mathematical ways of understanding linear algebra that don't have to do with equational reasoning.

[07:17] For example, graphical linear algebra approach, which came out of the world of category theory in the last 10 years. This nice blog post series by Pavel Sovachinsky that basically views linear algebra as what happens when adding meets copying,

[07:32] which is an interesting alternative way of looking at linear algebra and uses diagrammatic reasoning like you see on the bottom right of the slide rather than equational reasoning. It's completely rigorous. It's in some ways even more rigorous than the normal way that you do linear algebra,

[07:46] but it involves this very spatial picture-based reasoning. So if you're interested in abstract math, that's a fun direction to go. But I don't think either of those necessarily is the right approach for folks who work with software and hardware.

[08:00] For them, I think the right way to think about linear algebra is that linear algebra is more like programming. So matrices are our functions, shapes are the types of our data and our functions, and multiplication is function composition.

[08:14] This approach uses a bunch of ideas from programming, from computer science, rather than from, say, physics or from abstract math. So in programming, we combine functions with matching types through function and composition.

[08:28] We combine the functions together that way. In linear algebra, we combine matrices that have matching shapes through matrix multiplication. When we're doing programming, we define one function in terms of the other.

[08:40] So if I wanted to make a function that checks whether a string is too long to tweet, maybe tell the user, hey, this is too long. Shorten that string up if you want to be able to tweet it. I might define that by composing a couple of functions.

[08:53] First, I check the length of the string. Then I check whether the length of that string is over 140. So I combine these two functions together. That gives me too long to tweet as a function. That composition, if I were to look at the type of len,

[09:05] it takes, say, a string and returns an integer that's the length of the string over 140. takes an integer and returns a Boolean, a true or a false value. And so too long to tweet, take those two pieces, those arrows, and slap them onto each other.

[09:19] It now takes a string and returns a bool. That composition of those two functions takes the input of one to the output of the other in terms of types. And you can see that if we draw the typical way of representing functions in at least the American mathematical education system

[09:33] when you first learn about functions when you're in school, you see a picture that looks something like this. All the inputs vary in a little circle on one side. Then the function is like an arrow that points from one of those inputs to the output.

[09:47] So A and B are both strings of length 1. The beginning of the Declaration of Independence, when in the course of human events it becomes necessary, that's the strength of length 8007. 8007 is bigger than 140, and so the Declaration of Independence is too long to tweak.

[10:02] So you can read off how these functions behave by following those arrows. So we get this purple arrow representing 2 on to tweet by following these red and blue arrows representing len and over 140.

[10:15] And really this is the essence of programming If you ever do functional programming especially pure functional programming this is all that you do and it enough to do anything that you can do in any language We often do quite different things in other programming languages,

[10:29] but in principle, this is the core of what we are doing when we are programming computers, combining these things together. And in linear algebra, we do much the same thing. When we want to do function application, when we want to apply a function to data,

[10:43] that's matrix vector multiplication. Often you might think of matrix vector multiplication as an equational thing, as just a matrix vector on the left-hand side is equal to the thing on the right-hand side, the way you would say 2 plus 2 equals 4.

[10:56] But the view that I'm trying to put forward here is that you should instead think of this as M of V, M the matrix applied as a function to the vector V, is equal to the thing on the right-hand side, which for now we're just calling M V.

[11:09] Under that view, a matrix is a function that takes in arrays with a certain number of rows and puts out arrays with a certain number of rows. In this case, the number of rows in that vector, the sort of length of that vector, is just k here.

[11:24] I'm glad it could be a bunch of different things. But the output, the number of rows in the output, is given by the number of rows in that matrix. It's, in this case, 3. So like that function too long to tweak had a type signature that said it takes in strings and returns bools, this matrix has a type signature.

[11:42] It takes in arrays of length k1 and returns arrays of length 3.1. Let's just pick a particular matrix here, this matrix x here that's 1, 0, 0, 0, 1, 0.

[11:55] And let's view it the way we would view a function in that ovals and arrows way of viewing things. You can see the inputs on the left, the outputs on the right. So the outputs 0, 1, 2 and 0, 1, 5 both get taken to the same thing.

[12:09] And just as an exercise, if you were to name this the way you name a function in programming, what name would you give this matrix? X here. When we're doing math, we often think of these things as variables. We don't give them descriptive names.

[12:22] We give them names like X and W. But when we're thinking about functions, we often give them a specific name. So this X here is taking in inputs that look like things on the left and returning things that come out on the right here.

[12:37] When we combine matrices through matrix multiplication, when I take two matrices and multiply them together, that is like function composition. That is like that combination of two functions.

[12:50] So, for example, we could take that matrix X and then apply on the left matrix Y, and we would then get a different function, a different matrix that does something different to its input.

[13:03] As an example here, I've got this first matrix we already talked about that strips off that last entry. So it strips off the Z axis here. So then we get a vector that looks like this one in the middle here. So that red arrow here points from the inputs on the left to the outputs on the right.

[13:20] And then we can chain that with that second matrix, this matrix indicated in blue here, the Y matrix that takes two-dimensional inputs and returns two-dimensional outputs. And that one zeroes out the Y axis.

[13:32] So then we end up with a vector that lies just along the X axis with only the first entry of that vector that went in. And so this is, I think this way of drawing things is maybe a little bit more common for vectors than that oval way of drawing things.

[13:46] So I wanted to make sure we covered that. The combination of those two things, the combination of those two functions, is itself also a function, and is also, in fact, a matrix. And that matrix is obtained from the two submatrices by doing matrix multiplication.

[14:01] So that matrix multiplication rule, why did we use that specific matrix multiplication rule? It was so that this would be true. It was so that if we did matrix multiplication on the left of this matrix by this matrix,

[14:15] we would get this matrix representing their composition. So that is the reason for that matrix multiplication rule, fundamentally. This is the approach for thinking about linear algebra that I think is most fruitful for doing machine learning,

[14:27] and I think also for a lot of other branches that linear algebra gets applied in. But the remaining question is why linear algebra for machine learning? We've understood a little bit more about linear algebra, but we haven't really tackled why this is important for machine learning yet.

[14:43] I said that we use arrays to represent functions and represent data in machine learning, but why this? And what kinds of functions can we represent that way? So why, basically, is linear algebra so important for optimization and, hence, machine learning?

[14:58] I think the right tack to get to understanding this is to think about how would we represent functions as data? So the fundamental question, if we want to program by optimization, is that we need to be able to represent programs and manipulate them.

[15:14] So we're going to start with a program that doesn't work and then slowly over time obtain a program that does work. That's that process of optimization. In order to do that, we need to have that program and we need to be able to manipulate it somehow.

[15:26] The normal way that we represent programs as data is in strings, right? The usual way is build website.py, right? that is actually in the end a string of characters,

[15:38] and that's how humans tend to manipulate programs. In general, we could also represent our programs as dictionaries, as lookup tables, basically applying that oval and arrow view of functions to your program.

[15:51] So those are different ways to do it. Neither one is actually really good for optimization. If we want to optimize things quantitatively, actually the way that we represent programs is horrible, right? That's an absolutely horrible way to represent functions if we want to be able to optimize them.

[16:08] So just take an example, very simple Python program to check if a number is odd and return whether it's odd or even as a string. Pretty much any single character change to the code on the left would break it, would make it not do the right thing,

[16:22] or it would completely break the program, make it not valid Python anymore. It's a great way for representing programs in order to get programmers to work with it, for humans to adjust the behavior of programs.

[16:35] But for a computer, that's really difficult to be able to make these coordinated changes. And it's very difficult to make that numerical quantitative in a way that we can actually automate this process, which is the goal of machine learning.

[16:49] So the main way that we make progress is by restricting ourselves to a smaller set of functions and focusing on making those work well. And so looking for a more limited data type

[17:01] that maybe doesn't necessarily represent all possible functions, but is still powerful. If we focus on easy to optimize functions, we'll get two separate branches of machine learning, one based off of trees that gets you to things like random forests,

[17:15] gradient boosted trees, things like that, which we won't be talking about. And the other direction, which is towards using arrays. So using matrix multiplication as function application and thinking about only linear functions, at least at first.

[17:29] And the benefit is that arrays are really easy to change just a tiny bit. If I make a tiny change to the entries of a matrix, then I'll get a small change in the function that the matrix applies. So the outputs will only change a little bit if I change the matrix a little bit.

[17:44] And as we'll see when we talk about calculus, that's huge. That's extremely valuable. And then, of course, there's what we talked about, linear algebra being able to be made fast, and they're doing all these really useful things that you can do with it.

[17:56] And so at its core, this is why linear algebra is so important for machine learning. I want to go over a little bit what I mean by arrays represent linear functions, and now it's finally time to do a little bit of algebra.

[18:08] There's basically two rules for something to be a linear function. It needs to respect the first rule here. F of A plus B is F of A plus F of B. distribution over addition rule, and then it also needs to respect this distribution over multiplication rule,

[18:25] f of lambda times a is equal to lambda times f of a. So that says if I scale the inputs, then I just scale the outputs, and if I look at the output on the sum of two inputs, it's the sum of those outputs.

[18:38] So what this allows us to do is rearrange a problem that may come in looking like f of a plus f of b, which would require us to compute something twice, and swap that around just computing it once, f of a plus b.

[18:53] The reason why we have those two rules is because that means that linear functions play really nicely with weighted sums. If I combine those two rules, I can use them over and over again. If I multiply by a bunch of different scalars and if I add a whole bunch of things together,

[19:07] that's what this thing is doing here in the center, I can pull that out as well. Just as I could pull the plus sign out from the inside of the function application here and I can pull the scalar multiplication out to the outside of function application,

[19:22] I can do that with a whole bunch of sums and multiplications. So anytime that I do a weighted sum of things, my linear function will be able to either go on the inside or the outside. And that's huge.

[19:34] Weighted sums show up all over the place and basically makes linear functions really easy to reason about. So we'll see how this works with how linear functions interact with the number zero.

[19:46] So first off, 0 is going to always be sent to 0. So we get this from our multiplication rule. If I apply f to 0, that's the same as applying f to lambda times 0, because lambda times 0 is 0.

[20:00] And I can then pull that out, which means that lambda times f of 0 is equal to f of 0. That can only be true if f of 0 is 0. If inputs collide, we can also determine using these rules that more things have to be sent to zero.

[20:16] So if two things get sent to the same thing, then their difference has to be sent to zero. If f of a minus f of b is zero, so if a and b are sent to the same output by f, then f of a minus b is zero.

[20:31] So that difference between a and b, that thing gets sent to zero. So we can do that We just pulling that minus sign inside and if you being really technical we basically pulling a minus one inside and then using the addition rule Anything that is sent to zero is called the kernel of that function

[20:47] A minus B here is in the kernel of the function. So one thing that comes out of this is you can also check that result that we got, that zero has to be sent to zero, because if I set A and B equal to each other,

[20:59] then I get that same result. So zero is in the kernel. Zero is always sent to zero. other things might also be in the kernel. That kernel there is made up of weighted sums. Going back to this idea of weighted sums,

[21:13] if f of a is 0 and f of b is 0, then a plus b is also sent to 0. So f of a plus b, that's the same of f of a plus f of b, those are both 0, so the output is 0. And so this collection of things that get sent to 0

[21:26] is made up of weighted sums. If I take a whole bunch of things that are all sent to 0 and all I can do is combine them by weighted summing, then I'm still going to stay in this collection of things sent to zero. The thing is true of the collection of things not sent to zero.

[21:40] Anything not in the kernel, if I combine those things with weighted sums, then they'll stay outside of the kernel. I went through all this in order to get to this important concept of the rank of the function

[21:52] or the rank of the matrix. We can make new non-kernel elements, new things not sent to zero, by making weighted sums of known non-kernel elements. And rank answers the question, how many of these things do I need in order to make every possible thing that's not in the kernel?

[22:09] This collection of things comes to zero. Zero is not that interesting. So this is sort of telling us how many things do I need in order to understand all the interesting behavior of this matrix as a function or of this linear function.

[22:21] So that was just a little sort of quick exercise to give you a flavor for how reasoning works with linear functions and how much you can learn about linear functions with just a few quick linear algebraic manipulations.

[22:35] But I wanted to return to this idea of linear algebra as programming, linear algebra as less like those algebraic manipulations and more like the things we do when we program computers. The singular value decomposition is an important concept in linear algebra

[22:51] and in the understanding of matrices. And in my view, it is the equivalent of refactoring a program. but for linear algebra. Refactoring programs means taking a program and rewriting it,

[23:05] but not changing the way it behaves. You change the internals, maybe. You change just exactly what's going on under the hood, but you don't change anything about its overall behavior. The same inputs still go to the same outputs. We often refactor programs in order to understand

[23:20] them better. Sometimes we do it in order to make them run faster. There's lots of different reasons why we might refaster a program. And there's a couple of common tricks for doing that. One is to identify separation of concerns,

[23:33] to say, oh, this one piece of the program is doing too many things. I need to separate those out from each other. So in linear algebra, the equivalent of that is eigen decomposition. It says, okay, this matrix is working on an n-dimensional thing,

[23:45] but it's really actually doing n one-dimensional things, for example. And that's exactly what is done in eigen decomposition. He finds those n one-dimensional pieces and separate them out, that separation of concerns.

[23:58] We also remove code that doesn't do anything, right? You may identify, oh, this branch of code is effectively doing nothing. It's here, but it's dead weight. And that's very close to what we do when we do low-rank approximation.

[24:11] When we try and find a matrix that has much lower rank than the matrix, but still does about the same thing. So we'll talk about low-rank approximation in a little bit. The other piece that we often do when we refactor a program

[24:23] is we break something up into multiple functions. This is much like separation of concerns, but instead of maybe splitting it up in parallel, we're splitting it up in that we are taking one function

[24:35] that is doing several things in a row, and we are decomposing that into a bunch of pieces that then when composed together still do the same thing. So maybe I was loading data, then applying a function to data,

[24:50] then writing it to disk all in a single function. let's split that up into three functions that each do those three pieces. And then that's our final result. That's a refactor to break it up into those three pieces.

[25:04] And that is the equivalent of the singular value decomposition. So in linear algebra, we have the singular value decomposition, decomposing, breaking up into pieces, undoing a composition, and it achieves the same thing as that particular type of refactoring in programming.

[25:19] So in general, in fact, any function can be decomposed the way we do in the singular value decomposition, which I think is very cool, which is you basically break your function up into three pieces. So I've got this diagram here.

[25:31] We're going to go through an example with that isOdd function. So we have this single isOdd function here that just did it all in a couple of lines and goes straight from integers to strings. But that function can actually be broken down into pieces.

[25:43] And this can be useful. Refactoring can be useful in case maybe sometime later, something says, oh, we actually wanted to not return strings, but to return just the Booleans instead. Or we actually wanted to use this mod 3 rather than mod 2.

[25:57] And so breaking it up into pieces can make it easier to make those changes later. If we want to break this function down into pieces, there's a sort of natural way to break it down, in my view, which is to first do the part where we determine whether this number

[26:09] is even or odd with this mod 2 operation. Then we apply 2Bool. We take that integer and we turn it into a true or false value. And then we do something different for true and for false.

[26:21] So for true, we return odd. And for false, we return even. So that's what this breakdown achieves here. This split up into the mod2, tubule, two string setup. And this actually, this represents a very generic strategy.

[26:35] So one way of thinking about this is that mod2 picks representatives for this function. There are really only two outputs here. There's only odd and there's even.

[26:47] And so really, we only need to know what this function does on two inputs, and then know which inputs correspond to those two classes. There's going to be one representative for odd, one representative for even.

[27:01] And in this first step here, we're going to say, okay, do you get sent to the same value as zero, or do you get sent to the same value as one? That sort of simplifies this function down after this point When it comes to mapping those inputs to the final output, we only need to do it for those two specific representatives.

[27:16] And it makes this first step just a matter of saying, okay, these guys are all going to get treated the same. These guys are all going to get treated the same. So that's our first step here, picking out representatives. Then we have a step that's just a reversible renaming step.

[27:29] So associate each representative with this output, one to one. So zero is going to be even. It is false if this is odd. So that gets mapped to false. One gets mapped to true. And the important thing here is that each output is targeted by one and only one representative.

[27:45] So there's no mixing, no combination, no splitting, just one-to-one, a renaming operation. And then finally, we just need to put them into the correct type. We got a true and a false here, but what we wanted was actually a string at the end,

[27:59] because this is going out to a user who wants to know whether this is odd or even. We're effectively just recognizing a true and false are the outputs here, and we just need to map them onto what the string is that we want to show the user. So we're finding essentially a copy of the output of 2 bool, of true and false, inside of the type string.

[28:16] And we can apply that same breakdown to a matrix. That breakdown there is generic for any function. It's called the canonical decomposition. And we can apply that canonical decomposition onto a matrix.

[28:28] The equational way of writing is to say that m equals abc. So a times b times c. The sort of function view is to say that the function M that maps arrays of dimension N to arrays of dimension M,

[28:44] that function can be broken down into the composition of three functions, C, then B, then A. So this diagram here says the same thing as this equation over here.

[28:56] And in this setup, C is a wide matrix, B is a square matrix, and A is a tall matrix. So C has at least as many columns as it has rows.

[29:08] B has exactly as many columns as it has rows. And A has at least as many rows as it has columns. Typical case to think in mind here is where C is much wider than it is tall and A is much taller than it is wide.

[29:20] C here is going to have R rows and columns. N is the number of inputs, so it needs to have that many columns in order to be able to take the same inputs as M.

[29:32] but it's going to map it down to only R output. So it's going to shrink it down. And what this is doing is it's throwing out that kernel. So if A and B are sent to the same output by M, so we're picking out representatives in this step.

[29:46] We're saying any things that get sent to the same thing, we want to map those all to one value and only worry about one value. Like how we did that mod2 operation for isOdd to say, okay, I know what I want to do with 0 and 1.

[29:58] Let me map everything that is the same as 0 to 0, everything's the same as one to one. So here we're doing that same thing and now we're looking for two things that get sent to the same output by m. And so that connects us back to that kernel, right? And

[30:12] to that idea of rank. How many things do I need in order to build all possible outputs? And so that's why this output here has dimension r. It has the same size as that set that we need to build all

[30:25] non-kernel elements. But what effectively we're doing here at a high level you can think of it as throwing out everything that gets sent to zero. Then that second step here with B is our reversible relabeling. That's why this matrix is square. It takes in our inputs and returns our outputs. So

[30:41] it's just a relabeling. It says, okay, I'm going to maybe like slightly change around my axes, but I'm not really making any serious changes. And it's always reversible. It's square, so the inputs are the same size as the outputs. That means it's possible for it to be reversible. If the outputs

[30:56] are way bigger or way smaller than the inputs then it not always going to be possible to reverse that operation And so those matrices A and C are in general not reversible So this is our special reversible step

[31:08] And then lastly, we have a tall matrix at the end, A, whose outputs can be bigger than its inputs, and it effectively finds a copy of all rays of R elements among the set of all M element arrays. For example, if you can see my video here,

[31:21] if I wanted to find a copy of two-dimensional arrays among the set of all three-dimensional arrays, there are lots of possible examples. My fingers here represent two axes, and I'm moving them around in 3D space

[31:33] to represent a whole bunch of different potential copies of the set of all two-dimensional arrays among the set of all three-dimensional arrays, the three-dimensional space, like the one we live in. So this step here says, okay,

[31:45] these are dimensional arrays that came out of B. I want to match that to the outputs of M on its inputs, and so those outputs have dimension M. I need to sort of inject these R-dimensional arrays into that output space. This is very similar to choosing to say, okay, true corresponds to odd

[32:01] and false corresponds to even in our is odd example. There's many possible choices of two words out of the set of all possible strings, and we chose those in particular because that's what our function was supposed to output. So that gives us this decomposition into three pieces.

[32:17] One way to think about it is, if you're familiar with these ideas, the first function is the onto piece. It's the surjection piece of the function M. The part in the middle is the bijective piece

[32:31] of the function, or the isomorphism piece of the function, the reversible piece of the function. And then A is the injective piece of the function, the part that is into, one-to-one. So we're

[32:43] breaking our function down into those three types of pieces. The function M need not have any of those properties, but it has pieces that each have those properties of injectivity, bijectivity, and surjectivity. So that's this way of breaking down matrices in general. If we make

[33:01] certain special choices, then we get this singular value decomposition. If we aim to make that middle matrix diagonal, so it only has numbers in the rows and columns that are equal to each other,

[33:13] row one, column one, row two, column two, that's represented by this dashed line here, then we end up with this singular value decomposition, where U and Z are unitary matrices.

[33:25] They don't grow or shrink anything. In addition to throwing stuff out, all they do is change basis, change the axes, spin them around or reflect them. That's what these two, the two matrices, C and A, become these unitary matrices U and Z.

[33:39] And then that is the singular value decomposition. But the important thing about the theory of value decomposition, in my view, is that it breaks down those three pieces of surjection, bijection, and injection.

[33:51] This SVD is a very special way to break up a matrix, as may be indicated by its relationship to this canonical decomposition, this canonical way of breaking down matrices. And one reason why is it can be used to calculate low-rank approximations.

[34:05] If you've done principal components analysis, linear algebra-based technique for dimensionality reduction, pre-processing and analyzing data, that's based on the singular value decomposition. In a pure math setting, it's used to classify and measure matrices in linear algebra to determine the norm and size of matrices, to measure the inner product of matrices.

[34:24] It's all based on the singular value decomposition. But we're going to talk about how it's used in low-rank approximations, since this comes up in numerical linear algebra and in machine learning. when the rank is not close to their total dimension, they are said to have low rank.

[34:40] What that looks like if you actually look at a matrix with low rank, often, but not always, that shows up as a very obvious visual pattern that the matrix has some simple pattern to it. So just as an example, taking this video here from a security camera and turning it into a matrix

[34:57] where each column here represents a single frame of the security video, then you can see that it's basically constant all the time. There's these little deviations from that pattern inside the matrix.

[35:09] So basically, there's a simple pattern of just repeat the same column over and over again, representing effectively the fixed background of the scene. And I just want to compare that to what a full-ranked matrix would look like.

[35:21] So as an example, here's what a full-ranked matrix of the same size would look like. It's sort of more like what you would get by looking at the static coming out of a TV from an old-school TV between channels.

[35:33] Again, we're taking each frame here and making it a column, and then we're looking over time for this axis here. Each column here is one frame of this video, so the same as in this previous example with the security camera footage.

[35:46] But now, since it's static, there isn't this simple pattern. There isn't this simple, oh, the background stays the same all the time. A full-ranked matrix doesn't have those kinds of simple patterns in it and can't be broken down in the same way. So lots of matrices that we come across in real life when we measure data, like pointing the security camera somewhere and measuring its inputs, we end up with something that is nearly low rank.

[36:09] It's not exactly low rank. It can't be exactly represented by a simple pattern, but it's nearly low rank. So the simplest low rank pattern in the video is just this background. So if I just take this vector representing a single frame here and then repeat it over and over again,

[36:25] then I'll get this matrix over here, which is very close to the original matrix. So one thing to note here, this matrix multiplication here is kind of like a repeat or a tile function, right? If you were to implement a function that did this, you might call it repeat or tile.

[36:40] So that's what this all-ones vector is doing here. So another example of how we can think of matrices more like we think of functions in computer programs. And this is effectively an approximation to that original input. This guy here is very close to this guy on the right here.

[36:55] And this guy, it is rank one. One way to see that it is low rank is that it can be represented by this really tiny thing here. So the reason why this is useful is because we can use it to compress data. So we can take something that looks like this and turn it into just this thing here on the right hand side.

[37:13] So that would really compress our video down quite a bit. Of course, it would also remove all the things that are moving in the video, which is maybe too much compression. But the general principle of low-rank approximation is what JPEG is based off of.

[37:25] So JPEG basically does low-rank approximation to pictures using the Fourier transform. But I think what's more useful in this case, and this idea comes from FASC.AI's numerical linear algebra course, which is a great resource, is to actually do foreground-background separation.

[37:40] So if we take this original video frame and that rank one approximation for that frame looks like this, then if I subtract that off, then I get only that foreground. That small bit there that was deviating from this background pattern is of interest.

[37:56] So even when that approximation, that compression is not a good compression for our purposes, it can still be useful. And so these low rank approximations show up in both ways when we're doing data science and machine learning.

[38:09] Sometimes it's done to do compression. Sometimes it's done to pull out the interesting uncompressible pieces. So that's our three takeaways for today, that linear algebra is important. This idea of matrix multiplication, despite its relative simplicity,

[38:23] is actually very powerful, shows up in lots of different places. Linear algebra, despite the way it's normally taught, is not actually that much like algebra. It's more like computer programming in a lot of ways, where shapes are our types,

[38:35] matrices are our functions, and matrix multiplication is function composition. And then finally, the singular value decomposition is our matrix refactoring. When we want to refactor programs, common operation, something we need to do all the time

[38:51] when we're doing computer programming, refactoring shows up in the form of this singular value decomposition, among other ways that we might refactor matrices. Some more resources for understanding linear algebra. Effectively, we aren't really going to be able to cover all the mathematics you need for machine learning in just three sessions.

[39:08] And so I wanted to give you some pointers to more resources. And what is most useful to do next is very dependent on what your past history is with this branch of math and what your future plans are.

[39:20] So if you had a traumatic linear algebra class back in the day, maybe in college, and it was just maybe not your favorite class in college, and you felt like you didn't learn that much, then you should try Essence of Linear Algebra

[39:33] by 3Blue1Brown on YouTube. It's bingeable. You could fit it in an afternoon if you really wanted to with these really nice, slick animations. A lot of people say it's like a sort of religious experience of,

[39:45] oh, I finally see the light of linear algebra, like why it was so confusing for so long. So highly recommended. If you want to level up your abstract math jobs, if you really want to get a better handle on how to think about mathematics,

[39:58] read and understand mathematics, maybe for being able to read machine learning papers a little bit better, then that graphical linear algebra blog post series by Pavel Soboczynski is a great place to start. It's aimed at people who have not taken a proof-based math class before and to teach them

[40:12] how to do that, while also teaching them some cutting-edge mathematics, which is a great combination. It's well-written. If you're more of the sort of hacker type who really wants to get your hands dirty with some actual code, then I would try numerical linear algebra by fast.ai

[40:27] with Rachel Thomas. So it's an online class and textbook focusing on applications of linear algebra for machine learning. And if you're the type who prefers the traditional math class experience, then I would check out Linear Algebra Done Right by Sheldon Axler. The textbook has been

[40:43] free in the past. I'm not sure if it's currently free, but the lecture series on YouTube is certainly free. So it's Linear Algebra Done Right Lecture Series by Sheldon Axler goes through a very classic traditional math class approach to linear algebra, so if you wanted to do that,

[40:59] that would be the place to do it, I would think.

⚡ Saved you 0h 41m reading this? Transcribe any YouTube video for free — no signup needed.