In the previous video, we talked about a geometrical approach to understand what supervised learning is doing. And, in this approach, we can think of the training data set as a set of points in a space, the classes as - in this visualization - colors of the different points, and the supervised learning algorithm as finding a surface that separates points in different classes from each other. Now, once we have our separating surface, we can feed it new images, and our trained model can then make predictions about the class of the new image - whether it's a cat or dog. Now, hopefully these predictions are going to be accurate, but, of course, the model also can be mistaken in its predictions and make mistakes. How can we think about the quality of the trained model? Here, the important concept is "generalization." Now, to understand generalization, we should think about two possible extremes in which our model can go wrong. Now, one extreme called "underfitting" arises because the surface that the model finds doesn't capture the shape, which actually divides the two classes. And, often this happens because the model has two few parameters - it has too few knobs to adjust - and so, it's limited in the kind of surfaces it can find. On the other extreme is something called "overfitting." Overfitting often arises because the model has too many parameters - it has too many knobs - and what happens is it just memorizes the training data set instead of learning the underlying concept in the training data set. Now, we can think about using this example of these three models - one that's underfit; the one that we looked at before, which is this kind of smooth curve; and then, what we call here, the "overfit" model, which in this case is a very wiggly one. And, we can use the previous example of an image that the models were not trained on. So, this is a new image that it has never seen and we can ask it to classify this image as either cat or dog. And, we can see... what happens is that the underfit model - since it hasn't really captured well the actual shape of the classes - it's going to - maybe in this case - say the wrong prediction. The intermediate model hopefully will give us the right prediction. And, the overly complicated model that has overfit, because it has fit to essentially every nook and cranny in the training data set, it again will tend to be wrong more often than the best model. Now, when we're provided with our training data set and we fit the separating surface to it, there can be two different types of errors. The first kind of error is called a "training error" and this is an error where one of the actual training data set points is classified incorrectly. And, usually what the supervised learning algorithm does is it tries to reduce the number of such errors - with some caveats. This is - as I said - called a "training error." A different kind of error occurs when we now feed in some input to the model that it has never seen before - so, an image it was not trained on - and, if it misclassifies that, that's called a "testing error." We're now testing the model, now that it's learned, and it either makes a mistake or it doesn't. If it makes a mistake: that's a testing error. What's called in the literature "generalization performance" is the ability of the learning algorithm to not make testing errors. So, to do well on data it has never seen before. And, generalization performance is perhaps the central concept - one might say - in machine learning. It captures what we really care about - we don't really care about the machine learning algorithm being able to capture every nook and cranny of the training data set, to memorize the training data set. What we care about is the ability of the machine learning algorithm to find the real underlying patterns that generalize well to things it has never seen before. Now, we can think of taking a model and making it more complicated. For example, by increasing the number of parameters and the number of knobs it can tune. Usually, the training performance - so, the ability to not make mistakes on the training data - would continue to go down as we add more and more parameters. We can learn and memorize the training data set better and better and fit every nook and cranny. On the other hand, the generalization performance, which is what we actually care about, will tend to: first, improve as we make the model more complicated, and then, at a certain point, it will start declining again, and that is the point at which the model starts overfitting and just memorizing the data set instead of learning to generalize. So, you'll often see curves - like these shown on your lower left - showing how training performance decreases monotonically as we add more parameters, but testing performance has a minimum, and that is the number of parameters that's best for the situation. Another term you will often hear in the literature and you should know about is called a bias-variance decomposition. A bias-variance decomposition is a way to decompose the errors that the model makes. And, a bias-variance decomposition of the training error means that there's some bias and the bias comes from the model just being wrong, meaning it's not flexible enough and it can't capture the correct separating surface. The variance comes from the fact that it's capturing the randomness in our training data set. What that means is, for example, if we're learning to classify images as "cat" or "dog," there's going to be some randomness in the images that we're providing... as the training data sets. We don't want to memorize that randomness because, if we do, then the model is going to vary a great deal from training data set to training data set, and that is the variance part of the error. Now, you might ask: how should one select the number of parameters to the complexity of the model to avoid overfitting? So, to have the model be flexible enough to capture the real patterns in the data but not so flexible that it just memorizes a training data set instead of learning the underlying concept. One technique that you will hear about used and you will see used is something called "cross-validation" or "validation." In validation, we take our training data set and we split it into two chunks. So, one of the chunks now is going to be the new training data set, it's going to be a smaller part of the original data that were provided. And, the other chunk is going to be what is called the "validation data set." Now, we will now fit our model on the new training data set, which is a subset of what we were originally provided, and we will test its performance on how well it does on the validation data set. And, this is a way to estimate on how well it will do on completely new data that it hasn't seen before because we're testing it on data that was not used to tune its knobs. This is not a foolproof method and it has some pitfalls but it's a very powerful method that we can use to help us select the model with the right number of parameters or to choose between different alternative models. And, this is usually what's actually done to choose between alternative models.