In the previous video,
we talked about a geometrical approach

to understand what
supervised learning is doing.

And, in this approach,

we can think of the training data set
as a set of points in a space,

the classes as - in this visualization -
colors of the different points,

and the supervised learning algorithm
as finding a surface that separates points

in different classes from each other.

Now, once we have
our separating surface,

we can feed it new images,

and our trained model
can then make predictions

about the class of the new image -
whether it's a cat or dog.

Now, hopefully these predictions
are going to be accurate,

but, of course,

the model also can be mistaken
in its predictions

and make mistakes.

How can we think about
the quality of the trained model?

Here, the important concept
is "generalization."

Now, to understand generalization,
we should think about

two possible extremes
in which our model can go wrong.

Now, one extreme called "underfitting"

arises because the surface
that the model finds

doesn't capture the shape,
which actually divides the two classes.

And, often this happens because
the model has two few parameters -

it has too few knobs to adjust -

and so, it's limited in the kind of
surfaces it can find.

On the other extreme
is something called "overfitting."

Overfitting often arises because
the model has too many parameters -

it has too many knobs -

and what happens is
it just memorizes the training data set

instead of learning
the underlying concept

in the training data set.

Now, we can think about using
this example of these three models -

one that's underfit;

the one that we looked at before,
which is this kind of smooth curve;

and then, what we call here,
the "overfit" model,

which in this case is a very wiggly one.

And, we can use the previous example

of an image that
the models were not trained on.

So, this is a new image
that it has never seen

and we can ask it to classify this image
as either cat or dog.

And, we can see... what happens is
that the underfit model -

since it hasn't really captured well
the actual shape of the classes -

it's going to - maybe in this case -
say the wrong prediction.

The intermediate model hopefully
will give us the right prediction.

And, the overly complicated model
that has overfit,

because it has fit to

essentially every nook and cranny
in the training data set,

it again will tend to be wrong
more often than the best model.

Now, when we're provided with
our training data set

and we fit the separating surface to it,

there can be two different
types of errors.

The first kind of error
is called a "training error"

and this is an error where
one of the actual training data set points

is classified incorrectly.

And, usually what the supervised
learning algorithm does is

it tries to reduce the number
of such errors -

with some caveats.

This is - as I said -
called a "training error."

A different kind of error occurs
when we now feed in

some input to the model
that it has never seen before -

so, an image it was not trained on -

and, if it misclassifies that,
that's called a "testing error."

We're now testing the model,
now that it's learned,

and it either makes a mistake
or it doesn't.

If it makes a mistake:
that's a testing error.

What's called in the literature
"generalization performance"

is the ability of the learning algorithm
to not make testing errors.

So, to do well on data
it has never seen before.

And, generalization performance
is perhaps the central concept -

one might say - in machine learning.

It captures what we really care about -
we don't really care about

the machine learning algorithm
being able to capture

every nook and cranny
of the training data set,

to memorize the training data set.

What we care about is the ability
of the machine learning algorithm

to find the real underlying patterns
that generalize well

to things it has never seen before.

Now, we can think of taking a model
and making it more complicated.

For example, by increasing
the number of parameters

and the number of knobs it can tune.

Usually, the training performance -

so, the ability to not make mistakes
on the training data -

would continue to go down
as we add more and more parameters.

We can learn and memorize
the training data set better and better

and fit every nook and cranny.

On the other hand,

the generalization performance,
which is what we actually care about,

will tend to: first,

improve as we make the model
more complicated,

and then, at a certain point,
it will start declining again,

and that is the point at which
the model starts overfitting

and just memorizing the data set
instead of learning to generalize.

So, you'll often see curves -
like these shown on your lower left -

showing how training performance
decreases monotonically

as we add more parameters,
but testing performance has a minimum,

and that is the number of parameters
that's best for the situation.

Another term you will often hear
in the literature

and you should know about

is called a bias-variance decomposition.

A bias-variance decomposition

is a way to decompose the errors
that the model makes.

And, a bias-variance decomposition
of the training error

means that there's some bias

and the bias comes from
the model just being wrong,

meaning it's not flexible enough

and it can't capture
the correct separating surface.

The variance comes from the fact that

it's capturing the randomness
in our training data set.

What that means is,
for example,

if we're learning to classify images
as "cat" or "dog,"

there's going to be some randomness
in the images that we're providing...

as the training data sets.

We don't want to memorize
that randomness

because, if we do,
then the model is going to vary

a great deal from training data set
to training data set,

and that is the variance part
of the error.

Now, you might ask:

how should one select
the number of parameters

to the complexity of the model
to avoid overfitting?

So, to have the model
be flexible enough to capture

the real patterns in the data

but not so flexible that
it just memorizes a training data set

instead of learning
the underlying concept.

One technique that
you will hear about used

and you will see used

is something called "cross-validation"
or "validation."

In validation,
we take our training data set

and we split it into two chunks.

So, one of the chunks

now is going to be
the new training data set,

it's going to be a smaller part
of the original data that were provided.

And, the other chunk is going to be
what is called the "validation data set."

Now, we will now fit our model
on the new training data set,

which is a subset
of what we were originally provided,

and we will test its performance

on how well it does
on the validation data set.

And, this is a way to estimate
on how well it will do

on completely new data
that it hasn't seen before

because we're testing it on data
that was not used to tune its knobs.

This is not a foolproof method
and it has some pitfalls

but it's a very powerful method
that we can use to help us select

the model with the right number
of parameters

or to choose between
different alternative models.

And, this is usually what's actually done
to choose between alternative models.