Title: MIT Introduction to Deep Learning | 6.S191

Channel: Alexander Amini

Submission date: 2024-04-29

Introduction

00:00:00 - 00:07:25 Link to video

[Music]

Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course.

This is a very fast-paced and intense one-week course that we're about to go through together. We're going to cover the foundations of a very fast-moving field that has been rapidly changing over the past eight years that we have taught this course at MIT. Over the past decade, even before we started teaching this course, AI and deep learning have revolutionized many different areas of science, mathematics, physics, and so on.

Not that long ago, we faced challenges and problems that we did not think were necessarily solvable in our lifetimes. AI is now actually solving these problems beyond human performance. Each year that we teach this course, this lecture, in particular, is getting harder and harder to teach. For an introductory-level course, this lecture is supposed to cover the foundations. If you think of any other introductory course, like a 101 course in mathematics or biology, those lecture ones don't really change that much over time. But we're in a rapidly changing field of AI and deep learning, where even these types of lectures are rapidly changing.

Let me give you an example of how we introduced this course only a few years ago:

"Hi everybody, and welcome to MIT 6.S191, the official introductory course on deep learning taught here at MIT. Deep learning is revolutionizing so many fields, from robotics to medicine and everything in between. You'll learn the fundamentals of this field and how you can build these incredible algorithms. In fact, this entire speech and video are not real and were created using deep learning and artificial intelligence. In this class, you'll learn how. It has been an honor to speak with you today, and I hope you enjoy the course."

The really surprising thing about that video to me when we first did it was how viral it went a few years ago. Just in a couple of months of us teaching this course, that video went very viral. It got over a million views within only a few months. People were shocked by a few things, but the main one was the realism of AI to be able to generate content that looks and sounds extremely hyperrealistic.

When we created this video for the class only a few years ago, it took us about $10,000 in compute to generate just about a minute-long video. If you think about it, that's extremely expensive to compute something like that. Maybe a lot of you are not even impressed by the technology today because you see all the amazing things that AI and deep learning are producing now.

Fast forward to today, the progress in deep learning is astounding. People were making all kinds of exciting remarks about it when it came out a few years ago. Now, this is common stuff because AI is doing much more powerful things than this fun little introductory video.

Today, AI is generating content with deep learning being so commoditized. Deep learning is at our fingertips now, online, in our smartphones, and so on. In fact, we can use deep learning to generate these types of hyperrealistic pieces of media and content entirely from English language without even coding anymore. Before, we had to actually go in, train these models, and really code them to create that one-minute-long video. Today, we have models that will do that for us end-to-end directly from English language.

We can use these models to create something that the world has never seen before, like a photo of an astronaut riding a horse. These models can imagine those pieces of content entirely from scratch.

My personal favorite is how we can now ask these deep learning models to create new types of software, even themselves being software. For example, we can ask them to write this piece of TensorFlow code to train a neural network. We're asking a neural network to write TensorFlow code to train another neural network. Our model can produce examples of functional and usable pieces of code that satisfy this English prompt while walking through each part of the code independently. Not only does it produce the code, but it also educates and teaches the user on what each part of these code blocks is actually doing. You can see an example here.

What I'm trying to show you with all of this is just how far deep learning has come, even in a couple of years since we started teaching this course. Going back even from before that to eight years ago, the most amazing thing you'll see in this course, in my opinion, is that we try to teach you the foundations of all of this. We show you how all of these different types of models are created from the ground up and how we can make all of these amazing advances possible so that you can also do it on your own.

As I mentioned in the beginning, this introductory course is getting harder and harder to make every year. I don't know where the field is going to be next year, or even in one or two months' time from now, just because it's moving so incredibly fast. But what I do know is that what we will share with you in the course as part of this one week is going to be the foundations of all the technologies that we have seen up until this point. This will allow you to create that future for yourselves and to design brand new types of deep learning models using those fundamentals and foundations.

So let's get started with all of that and start to figure out how we can actually achieve all of these different pieces and learn all of these different components. We should start this by really tackling the foundations.

Course information

00:07:25 - 00:13:37 Link to video

From the very beginning, we need to ask ourselves: what is deep learning? I think all of you, before coming to this class today, have heard the term "deep learning." However, it's important to understand how this concept relates to other pieces of science you've learned about so far. To do that, we have to start from the very beginning and think about what intelligence is at its core—not even artificial intelligence, but just intelligence.

The way I like to think about this is that intelligence is the ability to process information, which will inform your future decision-making abilities. This is something we, as humans, do every single day. Artificial intelligence is simply the ability for us to give computers that same ability to process information and inform future decisions.

Machine learning is a subset of artificial intelligence. You should think of machine learning as the science of teaching computers how to process information and make decisions from data. Instead of hardcoding rules into machines and programming them like we used to do in software engineering classes, we now try to enable machines to process information and inform future decision-making abilities directly from data.

Going one step deeper, deep learning is a subset of machine learning that uses neural networks to process raw, unprocessed data. These neural networks can ingest large datasets and inform future decisions. This class is all about teaching machines how to process data, process information, and inform decision-making abilities from that data.

This program is split into two different parts: technical lectures and software labs. For example, this lecture is one part of the technical lectures. We will have several new updates this year, covering the rapidly changing advances in AI. Especially in some of the later lectures, you will see these updates. The first lecture today will cover the foundations of neural networks, starting with the building blocks of every neural network, called the perceptron. We will conclude the week with a series of exciting guest lectures from industry-leading sponsors of the course.

On the software side, after every lecture, you'll get software experience and project-building experience. This will allow you to take what we teach in lectures and deploy it in real code, producing results based on the learnings from the lecture. At the very end of the class, you'll have the opportunity to participate in a fun project pitch competition, similar to a Shark Tank-style competition, where you can win some awesome prizes.

Each day, we will have dedicated software labs that mirror the technical lectures, helping you reinforce your learnings. These labs are coupled with prizes for the top-performing software solutions. Today, we start with Lab 1 on music generation. You'll learn how to build a neural network that can learn from a bunch of musical songs, listen to them, and then compose brand new songs in the same genre.

Tomorrow, Lab 2 will focus on computer vision. You'll learn about facial detection systems and build one from scratch using convolutional neural networks. You'll also learn how to remove biases that exist in some of these facial detection systems, which is a significant problem for state-of-the-art solutions today. Finally, a brand new lab at the end of the course will focus on large language models. You'll take a multi-billion parameter large language model, fine-tune it to build an assistive chatbot, and evaluate a set of cognitive abilities ranging from mathematics to scientific reasoning to logical abilities.

At the very end, there will be a final project pitch competition, with up to 5 minutes per team. All of these are accompanied by great prizes, so there will be a lot of fun throughout the week. There are many resources to help with this class, and you'll see them posted here. You don't need to write them down because all the slides are already posted online. Please post to Piazza if you have any questions. We have an amazing team helping teach this course this year, and you can reach out to any of us if you have questions. Piazza is a great place to start.

Myself and Ava will be the two main lecturers for this course, especially Monday through Wednesday. We will also have some amazing guest lectures in the second half of the course, which you definitely want to attend because they cover the state-of-the-art sides of deep learning in the industry outside of academia.

Very briefly, I want to give a huge thanks to all of our sponsors. Without their support, this course, like every year, would not be possible.

Okay, so now let's start with...

Why deep learning?

00:13:37 - 00:17:20 Link to video

The fun stuff and my favorite part of the course are the technical parts. Let's start by just asking ourselves a question: why do we care about all of this? Why do we care about deep learning? Why did you all come here today to learn and to listen to this course?

To understand, I think we need to go back a little bit to understand how machine learning used to be performed. Machine learning typically would define a set of features, or you can think of these as a set of things to look for in an image or in a piece of data. Usually, these are hand-engineered, so humans would have to define these themselves. The problem with these is that they tend to be very brittle in practice, just by the nature of a human defining them.

The key idea of deep learning, and what you're going to learn throughout this entire week, is this paradigm shift of trying to move away from hand-engineering features and rules that computers should look for, and instead trying to learn them directly from raw pieces of data. What are the patterns that we need to look at in datasets such that if we look at those patterns, we can make some interesting decisions and interesting actions can come out?

For example, if we wanted to learn how to detect faces, think about how you would detect faces. If you look at a picture, what are you looking for to detect a face? You're looking for some particular patterns: eyes, noses, and ears. When those things are all composed in a certain way, you would probably deduce that that's a face, right? Computers do something very similar. They have to understand what patterns they look for, what the eyes, noses, and ears of those pieces of data are, and then from there, actually detect and predict from them.

The really interesting thing about deep learning is that these foundations for doing exactly what I just mentioned—picking out the building blocks, picking out the features from raw pieces of data, and the underlying algorithms themselves—have existed for many decades. The question I would ask at this point is: why are we studying this now, and why is all of this really blowing up right now with so many great advances?

For one, there are three things. Number one is that the data available to us today is significantly more pervasive. These models are hungry for data. You're going to learn about this in more detail, but these models are extremely hungry for data, and we're living in a world right now where data is more abundant than it has ever been in our history.

Secondly, these algorithms are massively compute-hungry and massively parallelizable, which means that they have greatly benefited from compute hardware that is also capable of being parallelized. The particular name of that hardware is called a GPU. GPUs can run parallel processing streams of information and are particularly amenable to deep learning algorithms. The abundance of GPUs and that compute hardware has also pushed forward what we can do in deep learning.

Finally, the last piece is the software. It's the open-source tools that are used as the foundational building blocks of deploying and building all of these underlying models that you're going to learn about in this course. These open-source tools have become extremely streamlined, making it extremely easy for all of us to learn about these technologies within an amazing one-week course like this.

The perceptron

00:17:20 - 00:24:30 Link to video

So let's start now with understanding, now that we have some of the background, exactly what is the fundamental building block of a neural network. That building block is called a perceptron. Every single neural network is built up of multiple perceptrons, and you're going to learn how those perceptrons compute information themselves and how they connect to these much larger billion-parameter neural networks.

The key idea of a perceptron, or even simpler, think of this as a single neuron. A neural network is composed of many neurons, and a perceptron is just one neuron. The idea of a perceptron is actually extremely simple, and I hope that by the end of today, this idea and the processing of a perceptron become extremely clear to you.

Let's start by talking about the forward propagation of information through a single neuron. Single neurons ingest information; they can actually ingest multiple pieces of information. Here you can see this neuron taking as input three pieces of information: X1, X2, and XM. We define the set of inputs called X1 through XM, and each of these inputs is going to be element-wise multiplied by a particular weight. This is denoted here by W1 through WM. There is a corresponding weight for every single input, and you should think of this as every weight being assigned to that input. The weights are part of the neuron itself.

Now, you multiply all of these inputs with their weights together and then add them up. We take this single number after that addition and pass it through what's called a nonlinear activation function to produce your final output, which we call y.

What I just said is not entirely correct. I missed out on one critical piece of information: the bias term. That bias term allows your neuron to shift its activation function horizontally on the x-axis. On the right side, you can now see this diagram illustrating mathematically that single equation that I talked through conceptually. Now you can see it mathematically written down as one single equation, and we can actually rewrite this using linear algebra, using vectors and dot products.

So let's do that. Our inputs are going to be described by a capital X, which is simply a vector of all of our inputs X1 through XM. Our weights are going to be described by a capital W, which is W1 through WM. The input is obtained by taking the dot product of X and W. That dot product does the element-wise multiplication and then sums all of the element-wise multiplications. The missing piece is that we're now going to add that bias term, here called w0. Then we're going to apply the nonlinearity, denoted as G.

I've mentioned this nonlinearity a few times, this activation function. Let's dig into it a little bit more to understand what this activation function is doing. I said it's a nonlinear function. Here you can see one example of an activation function. One commonly used activation function is called the sigmoid function, which you can see here on the bottom right-hand side of the screen. The sigmoid function is very commonly used because it takes as input any real number and squashes every input X into a number between 0 and 1. It's a common choice for things like probability distributions if you want to convert your answers into probabilities or teach a neuron to learn a probability distribution.

In fact, there are many different types of nonlinear activation functions used in neural networks. Here are some common ones. Throughout this presentation, you'll see these little TensorFlow icons, which relate some of the foundational knowledge we're teaching in the lectures to some of the software labs. This might provide a good starting point for the software parts of the class.

The sigmoid activation, which we talked about in the last slide, is shown on the left-hand side. It's popular because of the probability distributions, squashing everything between 0 and 1. You see two other very common types of activation functions in the middle and the right-hand side as well. The other very common one, probably the most popular now, is the ReLU activation function, or rectified linear unit. It's linear everywhere except there's a nonlinearity at x equals 0, a kind of step or break discontinuity. The benefit of this is that it's very easy to compute, still has the nonlinearity we need, and is very fast, just two linear functions piecewise combined with each other.

Now let's talk about why we need a nonlinearity in the first place. Why not just deal with a linear function that we pass all of these inputs through? The point of the activation function is to introduce nonlinearities. We want to allow our neural network to deal with nonlinear data because the world is extremely nonlinear. This is important because real-world data sets are nonlinear. If you look at data sets like this one with green and red points, and I ask you to build a neural network that can separate the green and red points, we need a nonlinear function to do that. We cannot solve this problem with a single line. If we used linear functions as our activation function, no matter how big our neural network is, it's still a linear function. Linear functions combined with linear functions are still linear, so no matter how deep or how many parameters your neural network has, the best they could do to separate these green and red points would look like this.

Adding nonlinearities allows our neural networks to be smaller by allowing them to be more expressive and capture more complexities in the data sets. This makes them much more powerful in the end.

Perceptron example

00:24:30 - 00:37:51 Link to video

Let's understand this with a simple example. Imagine I give you a trained neural network. What does it mean to have a trained neural network? It means I am giving you the weights, right? Not only the inputs, but I am also telling you what the weights of this neural network are.

So here, let's say the bias term ( w_0 ) is going to be 1, and our ( W ) vector is going to be ([3, -2]). These are just the weights of your trained neural network. Let's worry about how we got those weights in a second. This network has two inputs, ( x_1 ) and ( x_2 ).

Now, if we want to get the output of this neural network, all we have to do is the same story that we talked about before. It's the dot product of inputs with weights, adding the bias, and applying the nonlinearity. Those are the three components that you really have to remember as part of this class: dot product, add the bias, and apply a nonlinearity. This process keeps repeating over and over again for every single neuron. After that happens, that neuron is going to output a single number.

Now, let's take a look at what's inside that nonlinearity. It's simply a weighted combination of those inputs with those weights. If we look at what's inside of ( G ), it's a weighted combination of ( X ) and ( W ), added with a bias. That's going to produce a single number. In reality, for any input that this model could see, this is a two-dimensional line because we have two parameters in this model. We can actually plot that line and see exactly how this neuron separates points on these axes between ( x_1 ) and ( x_2 ). These are the two inputs of this model. We can visualize and interpret exactly what this neuron is doing. We can plot the line that defines this neuron.

Here, we're plotting when that line equals zero. If I give that neuron a new data point, say ( x_1 = -1 ) and ( x_2 = 2 ), just an arbitrary point in this two-dimensional space, we can plot that point. Depending on which side of the line it falls on, it tells us what the sign of the answer is going to be and also what the answer itself is going to be. If we follow the equation written on the top and plug in -1 and 2, we're going to get ( 1 - 3 - 4 ), which equals -6. When I put that into my nonlinearity ( G ), I'm going to get a final output of 0.2.

The important point to remember here is that the sigmoid function divides the space into two parts. It squashes everything between 0 and 1 but divides it implicitly by everything less than 0.5 and greater than 0.5, depending on if ( x ) is less than zero or greater than zero. Depending on which side of the line you fall on, if you fall on the left side of the line, your output will be less than 0.5 because you're on the negative side. If your input is on the right side of the line, your output will be greater than 0.5.

We can visualize this space, called the feature space of a neural network, in its entirety. We can understand exactly what it's going to do for any input it sees. But of course, this is a very simple neuron. It's not a neural network; it's just one neuron with only two inputs. In reality, the types of neurons and neural networks you'll be dealing with in this course will have millions or even billions of these parameters. Here, we only have two weights, ( W_1 ) and ( W_2 ), but today's neural networks have billions of these parameters. Drawing these types of plots becomes a lot more challenging and is actually not possible.

Now that we have some intuition behind a perceptron, let's start building neural networks and see how all of this comes together. Let's revisit that previous diagram of a perceptron. If there's only one thing to take away from this lecture, it's to remember how a perceptron works. The equation of a perceptron is extremely important for every single class that comes after today. There are only three steps: dot product with the inputs, add a bias, and apply your nonlinearity.

Let's simplify the diagram a little bit. I'll remove the weight labels from this picture. Now, you can assume that if I show a line, every single line has an associated weight that comes with it. I'll also remove the bias term for simplicity. Assume that every neuron has that bias term. The result here, now calling it ( Z ), which is just the dot product plus bias before the nonlinearity, is going to be linear. It's just a weighted sum of all those pieces. We have not applied the nonlinearity yet, but our final output is going to be ( G(Z) ), the activation function or nonlinear activation function applied to ( Z ).

Now, if we want to step this up a little bit more and say, "What if we had a multi-output function?" Now, we don't just have one output, but let's say we want to have two outputs. We can just have two neurons in this network. Every neuron sees all of the inputs that came before it. The top neuron will predict an answer, and the bottom neuron will predict its own answer. Importantly, each neuron has its own weights. Each neuron has its own lines that are coming into just that neuron. They act independently but can later communicate if you have another layer.

Let's start by initializing this process a bit further and thinking about it more programmatically. What if we wanted to program this neural network ourselves from scratch? Remember that equation I told you. It didn't sound very complex: take a dot product, add a bias, and apply a nonlinearity. Let's see how we would actually implement something like that.

To define the layer, which is a collection of neurons, we have to first define how that information propagates through the network. We can do that by creating a call function. First, we're going to define the weights for that network. Remember, every neuron has weights and a bias. Let's define those first. We're going to create the call function to see how we can pass information through that layer. This is going to take inputs, like what we previously called ( X ). It's the same story we've been seeing this whole class: matrix multiply or take a dot product of our inputs with our weights, add a bias, and then apply a nonlinearity. It's really that simple. We've now created a single-layer neural network.

This line in particular is the part that allows us to be a powerful neural network, maintaining that nonlinearity. The important thing to note is that modern deep learning toolboxes and libraries already implement a lot of these for you. It's important to understand the foundations, but in practice, all that layer architecture and logic is implemented in tools like TensorFlow and PyTorch through a dense layer. Here, you can see an example of creating and initializing a dense layer with two neurons, allowing it to feed in an arbitrary set of inputs. Here, we're seeing these two neurons in a layer being fed three inputs. In code, it's reduced to this one line of TensorFlow code, making it extremely easy and convenient to use these functions and call them.

Now, let's look at our single-layer neural network. This is where we have one layer between our input and our outputs. We're slowly and progressively increasing the complexity of our neural network to build up all of these building blocks. This layer in the middle is called a hidden layer because you don't directly observe it. You do observe the input and output layers, but the hidden layer is just a neuron layer that you don't directly observe. It gives your network more capacity and learning complexity. Since we now have a transformation function from inputs to hidden layers and hidden layers to output, we now have a two-layered neural network. This means we also have two weight matrices. We don't just have the ( W_1 ) which we previously had to create this hidden layer, but now we also have ( W_2 ) which does the transformation from hidden layer to output layer.

Yes, what happens to the nonlinearity in the hidden layer? If you have just linear, is it a perceptron or not?

Yes, every hidden layer also has a nonlinearity accompanied with it. That's a very important point because if you don't have that perceptron, then it's just a very large linear function followed by a final nonlinearity at the very end. You need that cascading and overlapping application of nonlinearities that occur throughout the network.

Awesome. Now, let's zoom in and look at a single unit in the hidden layer. Take this one, for example. Let's call it ( Z_2 ). It's the second neuron in the first layer. It's the same perceptron that we saw before. We compute its answer by taking a dot product of its weights with its inputs, adding a bias, and then applying a nonlinearity. If we took a different hidden node, like ( Z_3 ), the one right below it, we would compute its answer exactly the same way that we computed ( Z_2 ), except its weights would be different than the weights of ( Z_2 ). Everything else stays exactly the same. It sees the same inputs, but of course, I'm not going to actually show ( Z_3 ) in this picture.

Now, this picture is getting a little bit messy, so let's clean things up a little bit more. I'm going to remove all the lines and replace them with these boxes, these symbols that will denote what we call a fully connected layer. These layers now denote that everything in our input is connected to everything in our output. The transformation is exactly as we saw before: dot product, bias, and nonlinearity.

In code, this is extremely straightforward with the foundations that we've built up from the beginning of the class. We can now just define two of these dense layers: our hidden layer on line one with ( n ) hidden units, and then our output layer with two hidden output units.

Does that mean the nonlinearity function must be the same between layers?

The nonlinearity function does not need to be the same through each layer. Oftentimes, it is because of convenience. There are some cases where you would want it to be different as well. Especially in lecture two, you're going to see nonlinearities be different even within the same layer, let alone different layers. But unless for a particular reason, generally, convention is there's no need to keep them different.

Now, let's keep expanding our knowledge a little bit more. If we now want to make a deep neural network, not just a neural network like we saw in the previous slide, now it's deep. All that means is that we're now going to stack these layers on top of each other, one by one, creating a hierarchical model. The final output is now going to be computed by going deeper and deeper into the neural network. Again, doing this in code follows the exact same story as before, just cascading these TensorFlow layers on top of each other and going deeper into the network.

Applying neural networks

00:37:51 - 00:41:12 Link to video

Okay, so now this is great because we have at least a solid foundational understanding of how to not only define a single neuron but also how to define an entire neural network. You should be able to explain or understand how information goes from input through an entire neural network to compute an output.

So now, let's look at how we can apply these neural networks to solve a very real problem that I'm sure all of you care about. Here's a problem on how we want to build an AI system to learn to answer the following question: "Will I pass this class?" I'm sure all of you are really worried about this question.

To do this, let's start with a simple input feature model. The two features that we'll concern ourselves with are: 1. How many lectures you attend. 2. How many hours you spend on your final project.

Let's look at some of the past years of this class. We can observe how different people have lived in this space, between how many lectures they attended and how much time they spent on their final project. Each point represents a person, and the color of that point indicates whether they passed or failed the class. You can visualize this feature space that we talked about before.

Now, let's consider you. You fall right here, at the point (4, 5) in this feature space. You've attended four lectures and will spend five hours on the final project. You want to build a neural network to determine, given everyone else in the class from all the previous years, what is your likelihood of passing or failing this class.

We now have all the building blocks to solve this problem using a neural network. Let's do it. We have two inputs: the number of lectures you attend and the number of hours you spend on your final project, which are 4 and 5, respectively. We can pass these two inputs to our variables X1 and X2. These are fed into this single-layered, single-hidden-layered neural network with three hidden units in the middle.

We can see that the final predicted output probability for you to pass this class is 0.1 or 10%. This is a very bleak outcome. The actual probability is 1. Attending four out of the five lectures and spending five hours on your final project, you actually lived in a part of the feature space which was very positive. It looked like you were going to pass the class.

So, what happened here? Does anyone have any ideas why the neural network got this so terribly wrong? It's not trained. Exactly, this neural network is not trained. We haven't shown it any of the green and red data. You should really think of neural networks like babies. Before they see data, they haven't learned anything. There's no expectation that they should be able to solve any of these types of problems before we teach them something about the world.

So, let's teach this neural network something about the problem first and train it.

Loss functions

00:41:12 - 00:44:22 Link to video

We first need to tell our neural network when it's making bad decisions. We need to teach it, really train it, to learn exactly like how we as humans learn in some ways. We have to inform the neural network when it gets the answer incorrect so that it can learn how to get the answer correct. The closer the answer is to the ground truth, the better.

For example, the actual value for you passing this class was a probability of 1 (100%), but it predicted a probability of 0.1. We compute what's called a loss. The closer these two things are together, the smaller your loss should be, and the more accurate your model should be.

Let's assume that we have data not just from one student, but from many students. Many students have taken this class before, and we can plug all of them into the neural network and show them all to this system. Now, we care not only about how the neural network did on just this one prediction, but we care about how it predicted on all of these different people that the neural network has seen in the past as well during this training and learning process.

When training the neural network, we want to find a network that minimizes the empirical loss between our predictions and those ground truth outputs. We're going to do this on average across all of the different inputs that the model has seen.

If we look at this problem of binary classification, right, between yeses and nos, like "Will I pass the class or will I not pass the class?" it's a zero or one probability. We can use what is called the softmax function or the softmax cross-entropy function to determine if this network is getting the answer correct or incorrect. The cross-entropy function is a loss function that tells our neural network how far away these two probability distributions are. The output is a probability distribution, and we're trying to determine how bad of an answer the neural network is predicting so that we can give it feedback to get a better answer.

Now, let's suppose instead of predicting a binary output, we want to predict a real-valued output, like any number that can take any value, plus or minus infinity. For example, if you wanted to predict the grade that you get in a class, it doesn't necessarily need to be between 0 and 1 or 0 and 100 even. You could now use a different loss function to produce that value because our outputs are no longer a probability distribution.

For example, you might compute a mean squared error loss function between your true value or your true grade of the class and the predicted grade. These are two numbers, not probabilities necessarily. You compute their difference, square it to look at the distance between the two (an absolute distance where the sign doesn't matter), and then you can minimize this.

Okay, great. So let's put all of this together.

Training and gradient descent

00:44:22 - 00:49:52 Link to video

We want to find the network weights that achieve the lowest loss.

This loss information, combined with the problem of finding our network, forms a unified problem and a unified solution to actually train our neural network. We know that we want to find a neural network that will solve this problem on all this data on average, right? That's how we contextualized this problem earlier in the lectures. This means, effectively, that we're trying to find the weights for our neural network. What is this big vector W that we talked about earlier in the lecture? What is this vector W? Compute this vector W for me based on all of the data that we have seen.

Now, the vector W is also going to determine what the loss is. Given a single vector W, we can compute how bad this neural network is performing on our data. What is the loss? What is this deviation from the ground truth of our network based on where it should be? Remember that W is just a group of a bunch of numbers. It's a very big list of numbers, a list of weights for every single layer and every single neuron in our neural network. So, it's just a very big list or a vector of weights. We want to find that vector. What is that vector based on a lot of data? That's the problem of training a neural network.

Remember, our loss function is just a simple function of our weights. If we have only two weights in our neural network, like we saw earlier in the slide, then we can plot the loss landscape over this two-dimensional space. We have two weights, W1 and W2, and for every single configuration or setting of those two weights, our loss will have a particular value, which here we're showing as the height of this graph. For any W1 and W2, what is the loss? What we want to do is find the lowest point. What is the best loss? What are the weights such that our loss will be as good as possible? The smaller the loss, the better. We want to find the lowest point in this graph.

Now, how do we do that? The way this works is we start somewhere in this space. We don't know where to start, so let's pick a random place to start. From that place, let's compute what's called the gradient of the landscape at that particular point. This is a very local estimate of where the slope is increasing at my current location. That informs us not only where the slope is increasing but, more importantly, where the slope is decreasing. If I negate the direction, if I go in the opposite direction, I can actually step down into the landscape and change my weights such that I lower my loss. Let's take a small step, just a small step, in the opposite direction of the part that's going up. Let's take a small step going down, and we'll keep repeating this process. We'll compute a new gradient at that new point, and then we'll take another small step. We'll keep doing this over and over again until we converge at what's called a local minimum. Based on where we started, it may not be a global minimum of everywhere in this loss landscape, but let's find ourselves now in a local minimum. We're guaranteed to actually converge by following this very simple algorithm at a local minimum.

Let's summarize this algorithm. This algorithm is called gradient descent. Let's summarize it first in pseudocode and then we'll look at it in actual code in a second. There are a few steps. The first step is to initialize our location somewhere randomly in this weight space. We compute the gradient of our loss with respect to our weights. Then, we take a small step in the opposite direction and keep repeating this in a loop over and over again until we stop moving. Our network basically finds where it's supposed to end up.

We'll talk about this small step. We're multiplying our gradient by what I keep calling a small step. We'll talk about that a bit more in the later part of this lecture. For now, let's also very quickly show the analogous part in code as well. It mirrors very nicely. We'll randomly initialize our weight. This happens every time you train a neural network. You have to randomly initialize the weights. Then, you have a loop. Here, we're showing it without even convergence. We're just going to keep looping forever. We compute the loss at that location, compute the gradient (which way is up), and then we negate that gradient, multiply it by some learning rate (LR), denoted here as a small step, and then we take a direction in that small step.

Let's take a deeper look at this term here. This is called the gradient. This tells us which way is up in that landscape. It tells us even more than that. It tells us how our landscape, how our loss, is changing as a function of all of our weights. But I have not told you how to compute this, so let's talk about that process. That process is called backpropagation. We'll go through this very briefly and we'll start with the...

Backpropagation

00:49:52 - 00:54:57 Link to video

The simplest neural network that's possible, right? So we already saw the simplest building block, which is a single neuron. Now let's build the simplest neural network, which is just a one-neuron neural network. It has one hidden neuron; it goes from input to hidden neuron to output. We want to compute the gradient of our loss with respect to this weight, W2.

Okay, so I'm highlighting it here. We have two weights. Let's compute the gradient first with respect to W2. This tells us how much a small change in W2 affects our loss. Does our loss go up or down if we move our W2 a little bit in one direction or another?

So let's write out this derivative. We can start by applying the chain rule backwards from the loss through the output. Specifically, we can decompose this gradient into two parts. The first part is decomposing it from dJ/dW2 into dJ/dY, which is our output, multiplied by dY/dW2. This is possible because Y is only dependent on the previous layer.

Now, let's suppose we don't want to do this for W2 but for W1. We can use the exact same process, but now it's one step further. We'll replace W2 with W1. We need to apply the chain rule yet again to decompose the problem further. Now we propagate our old gradient that we computed for W2 all the way back one more step to the weight that we're interested in, which in this case is W1.

We keep repeating this process over and over again, propagating these gradients backwards from output to input to compute what we ultimately want: the derivative of our loss with respect to every weight in our neural network. This tells us how much a small change in every single weight in our network affects the loss. Does our loss go up or down if we change this weight a little bit in this direction or a little bit in that direction?

Yes, I think you used the term "neuron" and "perceptron." Is there a functional difference? Neuron and perceptron are the same. Typically, people say "neural network," which is why a single neuron has also gained popularity. But originally, a perceptron is the formal term. The two terms are identical.

Okay, so now we've covered a lot. We've covered the forward propagation of information through a neuron and through a neural network all the way through. We've also covered the backpropagation of information to understand how we should change every single one of those weights in our neural network to improve our loss. That was the backpropagation algorithm in theory. It's actually pretty simple; it's just the chain rule. There's nothing more than just the chain rule. The nice part is that deep learning libraries actually do this for you. They compute backpropagation for you, so you don't actually have to implement it yourself, which is very convenient.

But now it's important to touch on the practical aspects. Even though the theory is not that complicated for backpropagation, let's touch on it from a practical perspective. When you want to implement these neural networks, what are some insights?

Optimization of neural networks in practice is a completely different story. It's not straightforward at all. In practice, it's very difficult and usually very computationally intensive to do this backpropagation algorithm. Here's an illustration from a paper that came out a few years ago that attempted to visualize a very deep neural network's loss landscape. Previously, we had another depiction of how a neural network would look in a two-dimensional landscape. Real neural networks are not two-dimensional; they are hundreds, millions, or billions of dimensions. What would those loss landscapes look like? You can actually try some clever techniques to visualize them. This is one paper that attempted to do that, and it turns out that they look extremely messy.

The important thing is that if you do this algorithm and start in a bad place, depending on your neural network, you may not end up in the global solution. Your initialization matters a lot. You need to traverse these local minima and try to find the global minima. More than that, you need to construct neural networks that have loss landscapes that are much more amenable to optimization than this one. This is a very bad loss landscape. There are some techniques that we can apply to our neural networks to smooth out their loss landscape and make them easier to optimize.

Recall that update equation that we talked about earlier with gradient descent. There is this parameter here that we didn't talk about. We described this as the little step that you could take. It's a small number that multiplies with the...

Setting the learning rate

00:54:57 - 00:58:54 Link to video

The direction, which is your gradient, just tells you, "Okay, I'm not going to just go all the way in this direction; I'll just take a small step in this direction." In practice, even setting this value right—it's just one number—can be rather difficult. If we set the learning rate too small, the model can get stuck in these local minima. Here, it starts and kind of gets stuck in this local minima, converging very slowly. Even if it doesn't get stuck, if the learning rate is too large, it can overshoot. In practice, it may even diverge and explode, and you don't actually ever find any minima.

Ideally, we want to use learning rates that are not too small and not too large. They should be large enough to avoid those local minima but small enough such that they won't diverge and will actually still find their way into the global minima. So, something like this is what you should intuitively have in mind—something that can overshoot the local minima but find itself into a better minima and then finally stabilize there.

So, how do we actually set these learning rates in practice? What does that process look like? Idea number one is very basic: try a bunch of different learning rates and see what works. That's actually not a bad process in practice; it's one of the processes that people use. But let's see if we can do something smarter than this. Let's see how we can design algorithms that can adapt to the landscapes.

In practice, there's no reason why this should be a single number. Can we have learning rates that adapt to the model, the data, the landscapes, and the gradients that it's seeing around? This means that the learning rate may actually increase or decrease as a function of the gradients in the loss function, how fast we're learning, or many other options. There are many different ideas that could be done here. In fact, there are many widely used different procedures or methodologies for setting the learning rate. During your labs, we actually encourage you to try out some of these different ideas for different types of learning rates and even play around with what the effect of increasing or decreasing your learning rate is. You'll see very striking differences.

Why not just find the absolute minimum and test? Well, a few things. Number one is that it's not a closed space. Every weight can be plus or minus up to infinity. Even if it was a one-dimensional neural network with just one weight, it's not a closed space. In practice, it's even worse than that because you have billions of dimensions. Not only is your space in one dimension infinite, but you now have billions of infinite dimensions or billions of infinite support spaces. It's not something that you can just search every possible weight in your neural configuration. Testing every possible weight that this neural network could take is not practical, even for a very small neural network.

In your labs, you can really try to put all of this information into practice. This defines your model, number one. It defines your optimizer, which we previously denoted as this gradient descent optimizer. Here, we're calling it stochastic gradient descent or SGD. We'll talk about that more in a second. Also, note that your optimizer, which here we're calling SGD, could be any of these adaptive optimizers. You can swap them out, and you should swap them out. Test different things here to see the impact of these different optimizers.

Batched gradient descent

00:58:54 - 01:02:28 Link to video

Methods on your training procedure will help you gain valuable intuition for the different insights that come with it. I want to continue briefly, just for the end of this lecture, to talk about tips for training neural networks in practice and how we can focus on this powerful idea of batching data. This involves not seeing all of your data at once but instead using a method called batching.

To do this, let's very briefly revisit the gradient descent algorithm. The gradient computation, or the backpropagation algorithm, is a very computationally expensive operation. It's even more challenging because we previously described it in a way where we would have to compute it over a summation of every single data point in our entire dataset. That's how we defined it with the loss function—it's an average over all of our data points, which means summing over all of our data points' gradients.

In most real-life problems, this would be completely infeasible because our datasets are simply too big, and the models are too large to compute those gradients on every single iteration. Remember, this isn't just a one-time thing; it's every single step that you do. You keep taking small steps, so you need to repeat this process continuously.

Instead, let's define a new gradient descent algorithm called Stochastic Gradient Descent (SGD). Instead of computing the gradient over the entire dataset, let's just pick a single training point and compute that one training point's gradient. The nice thing about that is it's much easier to compute that gradient since it only needs one point. The downside is that it's very noisy and stochastic because it was computed using just one example. So, you have that tradeoff.

What's the middle ground? The middle ground is to take not one data point and not the full dataset but a batch of data. This is called a mini-batch. In practice, this could be something like 32 pieces of data, which is a common batch size. This gives us an estimate of the true gradient by averaging the gradients of these 32 samples. It's still fast because 32 is much smaller than the size of your entire dataset, but it's quick enough. It's still noisy, but it's usually okay in practice because you can still iterate much faster.

Since the batch size is normally not that large (think of something in the tens or hundreds of samples), it's very fast to compute this in practice compared to regular gradient descent. It's also much more accurate compared to stochastic gradient descent. The increase in accuracy of this gradient estimation allows us to converge to our solution significantly faster. It's not only about speed; the increase in accuracy of those gradients allows us to get to our solution much faster, which ultimately means we can train much faster and save compute.

Another really nice thing about mini-batches is that they allow for parallelizing our computation. This was a concept we talked about earlier in the class, and here's where it comes in. We can split up those batches. For example, if our batch size is 32, we can split them up onto different workers. Different parts of the GPU can tackle different parts of our data points. This allows us to achieve even more significant speed-ups using GPU architectures and hardware.

Regularization: dropout and early stopping

01:02:28 - 01:08:47 Link to video

Finally, the last topic I want to talk about before we end this lecture and move on to lecture number two is overfitting. Overfitting is a problem that exists in all of machine learning, not just deep learning. The key problem is determining whether your model is accurately capturing your true data set or just learning subtle details that are only specific to your training data.

To put it differently, we want to build models that can learn representations from our training data that still generalize to brand new, unseen test points. The real goal is to teach our model something based on a lot of training data, but we want it to perform well when deployed in the real world, encountering data it has never seen during training. Overfitting means that if your model is doing very well on your training data but very badly on testing data, it is overfitting to the training data it saw.

On the other hand, there's also underfitting. On the left-hand side of the image, you can see a model that is not fitting the data enough, which means it will achieve similar performance on your testing distribution, but both are underperforming the actual capabilities of your system. Ideally, you want to end up somewhere in the middle, which is not too complex where you're memorizing all the nuances in your training data, but still performing well on new data.

To address this problem in neural networks and machine learning in general, there are a few different techniques you should be aware of. These techniques are called regularization. Regularization is a technique that constrains your optimization problem to discourage complex models. It improves the generalization of your model on unseen data.

The most popular regularization technique is called Dropout. Let's revisit this picture of a deep neural network. During training, Dropout randomly sets some of the activations (outputs of every single neuron) to zero with some probability, typically 50%. This means that 50% of the neurons are shut down or "dropped out" during a forward pass, and only the remaining 50% forward pass information. This technique lowers the capacity of our neural network dynamically because, in the next iteration, a different 50% of neurons are dropped out. This forces the network to learn different pathways from input to output, preventing it from relying too heavily on any small part of the features present in the training data.

The second regularization technique is called early stopping. This technique is model-agnostic and can be applied to any type of model as long as you have a testing set. The idea is to stop training before the model starts to overfit. Overfitting is when the model's performance on the test set starts to worsen while the training performance continues to improve. By monitoring the loss over the course of training, you can identify the point where the test loss plateaus and starts to increase. This is the point where you should stop training to prevent overfitting.

In the beginning, both the training set and the test set losses decrease, indicating that the model is getting stronger. Eventually, the test loss plateaus and starts to increase, while the training loss continues to decrease. This is the point where you need to stop training because the model is starting to overfit. On the left-hand side, you can see the opposite problem where the model has not fully utilized its capacity, and the testing accuracy can still improve.

This idea is powerful and easy to implement in practice. All you need to do is monitor the loss over the course of training and pick the model where the testing accuracy starts to get worse.

I'll conclude this lecture by emphasizing the importance of these regularization techniques in building robust models that generalize well to unseen data.

Summary

01:08:47 - 01:09:57 Link to video

Just summarizing three key points that we've covered in the class so far. This is a very packed class, so the entire week is going to be like this, and today is just the start.

So far, we've learned the fundamental building blocks of neural networks, starting all the way from just one neuron, also called a perceptron. We learned that we can stack these systems on top of each other to create a hierarchical network and how we can mathematically optimize those types of systems.

Finally, in the very last part of the class, we talked about techniques, tips, and techniques for actually training and applying these systems into practice.

In the next lecture, we're going to hear from Ava on deep sequence modeling using RNNs and also a really new and exciting algorithm and type of model called the Transformer, which is built off of this principle of attention. You're going to learn about it in the next class.

For now, let's take a brief pause and resume in about five minutes so we can switch speakers and Ava can start her presentation.

Okay, thank you.