Sunteți pe pagina 1din 33

7/30/2019 coursera-university-of-washington/Linear Regression.

ipynb at master · tuanavu/coursera-university-of-washington

Branch: master Find file Copy path

coursera-university-of-washington / machine_learning / 2_regression / lecture / week1 / Linear Regression.ipynb

tuanavu Linear Regression

ae07d19 on Dec 6, 2015

1 contributor

Raw Blame History

1001 lines (1000 sloc) 41.4 KB

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 1/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

1) Regression fundamentals

1) Data and model

x: input
y: output
f(x): functional relationship, expected relationship between x and y

: error term

You can easily imagine that there are errors in this model, because you can have two
houses that have exactly the same number of square feet, but sell for very different prices
because they could have sold at different times. They could have had different numbers of
bedrooms, or bathrooms, or size of the yard, or specific location, neighborhoods, school
districts. Lots of things that we might not have taken into account in our model.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 2/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/M6cMZ/regression-fundamentals-data-model) 6:45

2) Block diagram

: predicted house sales price


: estimated function
y: actual sales price

we're gonna compare the actual sales price to the predicted sales price using the any given
. And the quality metric will tell us how well we do. So there's gonna be some error in our
predicted values. And the machine learning algorithm we're gonna use to fit these
regression models is gonna try to minimize that error. So it's gonna search over all these
functions to reduce the error in these predicted values.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 3/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/fKsPh/regression-ml-block-diagram) 3:48

2) The simple linear regression model, its use,


and interpretation

1) The simple linear regression model

What's the equation of a line?

it's just (intercept + slope * our variable of interest) so that we're gonna say that's

And what this regression model then specifies is that each one of our observations is
simply that function evaluated at , so:

: error term, the distance from our specific observation back down to the line
: intercept and slope respectively, they are called regression coefficients.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 4/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/N8p7w/the-simple-linear-regression-model) 1:12

2) The cost of using a given line

What is the difference between residual and error?

https://www.coursera.org/learn/ml-regression/lecture/WYPGc/the-cost-of-using-a-given-
line/discussions/Lx0xn5j1EeW0dw6k4EUmPw (https://www.coursera.org/learn/ml-
regression/lecture/WYPGc/the-cost-of-using-a-given-
line/discussions/Lx0xn5j1EeW0dw6k4EUmPw)

Residual is the difference between the observed value and the predicted value.
Error is the difference between the observed value and the (often unknown) true
value. As such, residuals refer to samples whereas errors refer to populations.
There is a true function f(x) that we want to learn, and the observed values we
have are in fact: , because we need to assume our measures have
some error (or noise if you prefer this term). This is the real error, because the
real value is . In the other hand, the residual is , where is our
approximation (estimation) of the real f(x)

Residual sum of squares (RSS)

The sum of all the differences between predicted values and actual values and then square
the sum.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 5/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/WYPGc/the-cost-of-using-a-given-line) 3:26

3) Using the fitted line

a model is in terms of sum parameters and a fitted line is a specific example within
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 6/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

that model class

Why the hat notation?

https://www.coursera.org/learn/ml-regression/lecture/RjYbf/using-the-fitted-
line/discussions/QOsWrZkGEeWKNwpBrKr_Fw (https://www.coursera.org/learn/ml-
regression/lecture/RjYbf/using-the-fitted-line/discussions/QOsWrZkGEeWKNwpBrKr_Fw)

In statistics, the hat operator is used to denote the predicted value of the
parameter.

eg: Y-hat stands for the predicted values of Y (house-price).

http://mathworld.wolfram.com/Hat.html (http://mathworld.wolfram.com/Hat.html)

https://en.wikipedia.org/wiki/Hat_operator (https://en.wikipedia.org/wiki/Hat_operator)

The hat denotes a predicted value, as contrasted with an observed value. For our
purposes right now, I think of the hat value as the value that sits on the regression
line, because that's the value our regression analysis would predict. So, for
example, the residual for a particular observation is - , where is the actual
observed outcome at a particular observed value of x and is the value that our
regression analysis predicts for y at that same x value.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 7/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/RjYbf/using-the-fitted-line) 2:00

4) Interpreting the fitted line

: predicted changed in the output per unit change in the input.

One thing I want to make very, very clear is that the magnitude of this slope, depends both
on the units of our input, and on the units of our output. So, in this case, the slope, the units
of slope, are dollars per square feet. And so, if I gave you a house that was measured in
some other unit, then this coefficient would no longer be appropriate for that.

For example, if the input is square feet and you have another house was measured in
square meters instead of square feet, well, clearly I can't just plug into that equation.

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/x8ohF/interpreting-the-fitted-line) 3:54

3) An aside on optimization: one dimensional


objectives
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 8/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington
objectives

1) Defining our least squares optimization


objective

Our objective here is to minimize over all possible combinations of , and

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/a1QCT/defining-our-least-squares-optimization-objective) 2:35

2) Finding maxima or minima analytically

Concave function: The way we can define a concave function is we can look at any
two values of w: a, and b. Then we draw a line between a and b, that line lies
below g(w) everywhere.
Convex function: Opposite properties of Concave function where the line connects
g(a) and g(b) is above g(w) everywhere
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 9/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington
g(a) and g(b) is above g(w) everywhere.
There a functions which are neither Concave nor Convex function where the line
lies both below and above the g(w) function.

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/RUtxG/finding-maxima-or-minima-analytically) 3:49

In a concave function, at the point of max g(w), this is the point where the
derivative = 0. Same thing for convex function, if we want to find minimum of all w
over g(w), at the minimum point, the derivative = 0.

Example:

When we draw this derivative, we can see it has a concave form. How do I find this
maximum?
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 10/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington
maximum?

I take this derivative and set it equal to 0, and we can solve it for w = 10.

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/RUtxG/finding-maxima-or-minima-analytically) 6:45

3) Finding the max via hill climbing

If we're looking at these concave situations and our interest is in finding the max over all w
of g(w) one thing we can look at is something called a hill-climbing algorithm. Where it's
going to be an integrative algorithm where we start somewhere in this space of possible w's
and then we keep changing w hoping to get closer and closer to the optimum.

Okay, so, let's say that we start w here. And a question is well should I increase w, move
w to the right or should I decrease w and move w to the left to get closer to the
optimal.

What I can do is I can look at the function at w and I can take the derivative and if
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 11/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

the derivative is positive like it is here, this is the case where I want to increase w. If
the derivative is negative, then I want to decrease w.
So, we can actually divide the space into two.
Where on the left of the optimal, we have that the derivative of g with
respect to w is greater than 0. And these are the cases where we're
gonna wanna increase w.
And on the right-hand side of the optimum we have that the derivative of g
with respect to w is negative. And these are cases where we're gonna
wanna decrease w.
If I'm exactly at the optimum, which maybe I'll call . I do not want to
move to the right or the left, because I want to stay at the optimum. The
derivative at this point is 0.

So, again, the derivative is telling me what I wanna do. We can write down this climbing
algorithm:

While not converged, I'm gonna take my previous w, where I was at iteration t, so t
is the iteration counter. And I'm gonna move in the direction indicated by the
derivative. So, if the derivative of the function is positive, I'm going to be increasing
w, and if the derivative is negative, I'm going to be decreasing w, and that's exactly
what I want to be doing.
But instead of moving exactly the amount specified by the derivative at that point,
we can introduce something, I'll call it eta ( ). And is what's called a step size.
So, when I go to compute my next w value, I'm gonna take my previous w value
and I'm going to move and amount based on the derivative as determined by the
step size.

Example: Let's say I happen to start on this left hand side at this w value here. And at this
pointthe derivative is pretty large. This function's pretty steep. So, I'm going to be taking a
big step. Then, I compute the derivative. I'm still taking a fairly big step. I keep stepping
increasing. What I mean by each of these is I keep increasing w. Keep taking a step in w.
Going, computing the derivative and as I get closer to the optimum, the size of the derivative
has decreased.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 12/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/O4j1e/finding-the-max-via-hill-climbing) 3:34

4) Finding the min via hill descent

We can use the same type of algorithm to find the minimum of a function.

When the derivative is positive we want to decrease w and when the derivative is
negative, you wanna increase w

The update of the hill descent algorithm is gonna look almost exactly the same as the hill
climbing, except we have the minus sign, so we're going to move in the opposite direction.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 13/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/zVcGn/finding-the-min-via-hill-descent) 2:40

5) Choosing stepsize and convergence criteria

So in these algorithms, I said that there's a stepsize. Stepsize is denoted as . This


determines how much you're changing your W at every iteration of this algorithm.

One choice you can look at is something called fixed stepsize or constant stepsize, where,
as the name implies, you just set eta equal to some constant. So for example, maybe 0.1.

But what can happen is that you can jump around quite a lot. I keep taking these big steps.
And I end up jumping over the optimal to a point here and then I jump back and then I'm
going back and forth, and back and forth. And I converge very slowly to the optimal itself.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 14/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/3UvFZ/choosing-stepsize-and-convergence-criteria) 2:00

A common choice is to decrease the stepsize as the number of iterations increase.

One thing you have to be careful about is not decreasing the stepsize too rapidly. Because if
you're doing that, you're gonna, again, take a while to converge. Because you're just gonna
be taking very, very, very small steps. Okay, so in summary choosing your stepsize is just a
bit of an art.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 15/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/3UvFZ/choosing-stepsize-and-convergence-criteria) 4:30

How are we going to assess our convergence?

Well, we know that the optimum occurs when the derivative of the function is equal to 0.

But what we're gonna see in practice, is that the derivative, it's gonna get smaller and
smaller, but it won't be exactly 0. At some point, we're going to want to say, okay, that's good
enough. We're close enough to the optimum. I'm gonna terminate this algorithm.

In practice, stop when

is a threshold I'm setting. Then if this is satisfied, then I'm gonna terminate the algorithm
and return whatever solution I have . So in practice, we're just gonna choose epsilon to
be very small. And I wanna emphasize that what very small means depends on the data
that you're looking at, what the form of this function is.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 16/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/3UvFZ/choosing-stepsize-and-convergence-criteria) 5:30

4) An aside on optimization: multidimensional


objectives

1) Gradients: derivatives in multiple


dimensions

So up until this point we've talked about functions of just one variable and finding there
minimum or maximum. But remember when we were talking about residual sums of
squares, we had two variables. Two parameters of our model, w0 and w1. And we wanted
to minimize over both.

Let's talk about how we're going to move functions defined over multiple variables. Moving
to multiple dimensions here, where when we have these functions in higher dimensions we
don't talk about derivatives any more we talk about gradients in their place.

What is a gradient?

: notation of gradient of a function


w: vector of different w's

The definition of a gradient: it's gonna be a vector, where we're gonna look at what are
called the partial derivatives of g. We're going to look at the partial with respect to W zero.
The partial of G with respect to W one. W one all the way up to the partial of G with respect
to some WP.

It's exactly like a derivative where we're taking the derivative with respect to in this case W
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 17/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington
one. But what are we going to do with all the other W's? W zero, W two, W three all the up
to WP? Well we're just going to treat them like constants.

So the vector represents a (p + 1), cause we're indexing starting at zero. This is a (p+1) -
dimensional vector.

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/ZwU5b/gradients-derivatives-in-multiple-dimensions) 3:05

Work through Example

We have a function:

Let's compute the gradient of w

partial of G with respect to W zero: , in this case we treat like a


constant.
partial of G with respect to W one: , where is a constant in
this case.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 18/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

This is my gradient:

If I want to look at the gradient at any point on this surface well I'm just going to plug in
whatever the W one and W zero values are at this point, and I'm going to compute the
gradient. And it'll be some vector. It's just some number in the first component, some
number in the second component, and that forms some factor.

Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/ZwU5b/gradients-derivatives-in-multiple-dimensions) 5:08

2) Gradient descent: multidimensional hill


descent

Instead of looking at these 3D mesh plots that we've been looking at, we can look at a
contour plot, where we can kind of think of this as a bird's eye view.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 19/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/6PJ3h/gradient-descent-multidimensional-hill-descent) 2:37

Let's talk about the gradient descent algorithm, which is the analogous algorithm to what I
call the hill decent algorithm in 1D.

But, in place of the derivative of the function, we've now specified the gradient of the
function. And other than that, everything looks exactly the same. So what we're doing, is
we're taking we now have a vector of parameters, and we're updating them all at once.
We're taking our previous vector and we're updating with our sum, eta times our gradient
which was also a vector. So, it's just the vector analog of the hill descent algorithm.

If we take a look at the picture, we start at a point where the gradient is actually pointing in
the direction of steepest assent (up hill). But we're moving in the negative gradient direction.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 20/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/6PJ3h/gradient-descent-multidimensional-hill-descent) 5:30

5) Finding the least squares line

1) Computing the gradient of RSS

Now, we can think about applying these optimization notions and optimization algorithms
that we described to our specific scenario of interest. Which is searching over all possible
lines and finding the line that best fits our data.

So the first thing that's important to mention is the fact that our objective is Convex. And
what this is implies is that the solution to this minimization problem is unique. We know
there's a unique minimum to this function. And likewise, we know that our gradient descent
algorithm will converge to this minimum.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 21/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/PRcIZ/computing-the-gradient-of-rss) 1:20

Let's return to the definition of our cost, which is the residual sum of squares of our two
parameters, (wo,w1), which I've written again right here.

But before we get to taking the gradient, which is gonna be so crucial for our gradient
descent algorithm, let's just note the following fact about derivatives, where if we take the
derivative of the sum of functions over some parameter, some variable w. So N different
functions, g1 to gN, the derivative distributes across the sum. And we can rewrite this as the
sum of the derivative, okay? So the derivative of the sum of functions is the same as the
sum of the derivative of the individual functions so in our case, we have that .

The that I'm writing here is equal to:

And we see that the residual sum of squares is indeed a sum over n different functions of
w0 and w1. And so in our case when we're thinking about taking the partial of the residual
sum of squares. With respect to, for example w0, this is going to be equal:

And the same holds for W1.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 22/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/PRcIZ/computing-the-gradient-of-rss) 3:45

Let's go ahead and actually compute this gradient

And the first thing we're gonna do is take the derivative or the partial with respect to the W0.

Okay, so I'm gonna use this fact that, I showed on the previous slide to take the sum to the
outside. And then I'm gonna take the partial with respect to the inside. So, I have a function
raised to a power. So i'm gonna bring that power down. I'm gonna get a 2 here, rewrite the
function. That's W1 XI, and now the power is just gonna be 1 here. But then, I have to take
the derivative of the inside. And so what's the derivative of the inside when I'm taking this
derivative with respect to W0? Well, what I have is I have a -1 multiplying W0, and
everything else I'm just treating as a constant. So, I need to multiply by -1.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 23/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/PRcIZ/computing-the-gradient-of-rss) 4:58

Let's go ahead and now take the derivative or the partial with respect to W1

So in this case, again I'm pulling the sum out same thing happens where I'm gonna bring
the 2 down. And I'm gonna rewrite the function here, the inside part of the function, raise it
just to the 1 power. And then, when I take the derivative of this part, this inside part, with
respect to W1, What do I have? Well all of these things are constants with respect to W1,
but what's multiplying W1? I have a (-xi) so I'm going to need to multiply by (-xi).

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 24/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/PRcIZ/computing-the-gradient-of-rss) 5:58

Let's put this all together

So this is the gradient of our residual sum of squares, and it's a vector of two dimensions
because we have two variables, w0 and w1. Now what can w think about doing? Well of
course we can think about doing the gradient descent algorithm. But let's hold off on that
because what do we know is another way to solve for the minimum of this function?

Well we know we can, just like we talked about in one D, taking the derivative and setting it
equal to zero, that was the first approach for solving for the minimum. Well here we can take
the gradient and set it equal to zero.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 25/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/PRcIZ/computing-the-gradient-of-rss) 7:13

2) Approach 1: closed-form solution

So this is gonna be our Approach 1. And this is drawn here on this 3D mesh plot, where that
green surface is the gradient at the minimum. And what we see is that's where the gradient
= 0. And that red dot is the, the optimal point that we're gonna be looking at.

Let's take this gradient, set it equal to zero and solve for W0 and W1. Those are gonna be
our estimates of our two parameters of our model that define our fitted line.

I'm gonna take the top line and I'm going to set it equal to 0. The reason I'm putting the hats
on are now, these are our solutions. These are our estimated values of these parameters.

Break down the math

https://www.coursera.org/learn/ml-regression/lecture/G9oBu/approach-1-closed-form-
solution/discussions/Ywv6RZfxEeWNbBIwwhtGwQ (https://www.coursera.org/learn/ml-
regression/lecture/G9oBu/approach-1-closed-form-
solution/discussions/Ywv6RZfxEeWNbBIwwhtGwQ)

First lets make our notation shorter:

$\sum\limits_{i=1}^N=\Sigma$

Top term (Row 1)

1) Break summation down:

2) Divide both sides by -2:

3) Summation of a constant replacement:

4) Solve for :

5) Divide by N:
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 26/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

6) Move constant out:

7) Put party hats on and :

Bottom term (Row 2)

1) Factor in :

2) Break summation down:

3) Divide both sides by -2:

4) Move constants and out:

5) Replace with equation from Row 1, Line 6:

6) Multiply by 1 namely N/N to remove denominator of N:

7) Group numerator and multiply By N to remove denominator:

8) Group terms:

9) Solve for :

10) Divide top and bottom by -N (Same as multiply by 1):

11) Put party hat on :

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 27/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/G9oBu/approach-1-closed-form-solution) 5:11

3) Approach 2: gradient descent

We discussed the other approach that we can take is to do Gradient descent where we're
walking down this surface of residual sum of squares trying to get to the minimum. Of
course we might over shoot it and go back and forth but that's the general idea that we're
doing this iterative procedure. And in this case it's useful to reinterpret this gradient of the
residual sum of squares that we computed previously.

A couple notation:

: actual house sales observation


: predicted value, but I am gonna write it as a function of and , to
make it clear that it's the prediction I'm forming when using and , so:

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 28/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/Ifx9C/approach-2-gradient-descent) 1:30

Then we can write our gradient descent algorithm

while not converged, we're gonna take our previous vector of W0 at iteration T, W1 at
iteration T and We're going to subtract eta times the gradient.

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/Ifx9C/approach-2-gradient-descent) 3:25

We can also rewrite the algorithm.

I want it in this form to provide a little bit of intuition here. Because what happens if overall,
https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 29/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

we just tend to be underestimating our values y?

So, if overall, we're under predicting , then we're gonna have that the sum of yi- y hat i is
going to be positive ( is positive). Because we're saying that is always below,
or in general, below the true value , so this is going to be positive.

And what's gonna happen? Well, this term here is positive. We're multiplying by
a positive thing, and adding that to our vector W. So is going to increase. And that
makes sense, because we have some current estimate of our regression fit. But if generally
we're under predicting our observations that means probably that line is too low. So, we
wanna shift it up. That means increasing .

So, there's a lot of intuition in this formula for what's going on in this gradient descent
algorithm. And that's just talking about this first term , but then there's this second term
, which is the slope of the line. And in this case there's a similar intuition. But we need to
multiply by this , accounting for the fact that this is a slope term.

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/Ifx9C/approach-2-gradient-descent) 6:00

4) Comparing the approaches


https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 30/33
7/30/2019
) p g ppcoursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

Let's take a moment to compare the two approaches that we've gone over, either setting the
gradient equal to zero or doing gradient descent.

Well, in the case of minimizing residual sum of squares, we showed that both were fairly
straight forward to do. But in a lot of the machine learning method's that we're interested in
taking the gradient and setting it equal to zero, well there's just no close form solution to that
problem. So, often we have to turn to method's like gradient descent.

And likewise, as we're gonna see in the next module, where we turn to having lots of
different inputs, lots of different features in our regression. Even though there might be a
close form solution to setting the gradient equal to zero, sometimes in practice it can be
much more efficient computationally to implement the gradient descent approach.

And likewise, as we're gonna see in the next module, where we turn to having lots of
different inputs, lots of different features in our regression. Even though there might be a
close form solution to setting the gradient equal to zero, sometimes in practice it can be
much more efficient computationally to implement the gradient descent approach.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 31/33


7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


regression/lecture/5oBEV/comparing-the-approaches) 1:20

6) Discussion and summary of simple linear


regression

1) Asymmetric cost functions

Let's discuss about the intuition of what happens if we use a different measure of error.

Okay so this residual sum of squares that we've been looking at is something that's called a
Symmetric cost function. And that's because what we're assuming when we look at this
aerometric is the fact that if we over estimate the value of our house, that has the same cost
as if we under estimate the value of the house.

*Screenshot taken from Coursera (https://www.coursera.org/learn/ml-


https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 32/33
7/30/2019 coursera-university-of-washington/Linear Regression.ipynb at master · tuanavu/coursera-university-of-washington

regression/lecture/JdRnN/asymmetric-cost-functions) 2:41

What happens if there might not be symmetric cost to these error?

But what if the cost of listing my house sales price as too high is bigger than the cost if I
listed it as too low?

If I list the price too high, there will be no offers.


If I list the price too low, I still get offer, but not as high as I could have if I had more
accurately estimated the value of the house.

So in this case it might be more appropriate to use an asymmetric cost function where the
errors are not weighed equally between these two types of mistakes. So if we choose
asymmetric cost, I prefer to underestimate the value than over.

https://github.com/tuanavu/coursera-university-of-washington/blob/master/machine_learning/2_regression/lecture/week1/Linear Regression.ipynb 33/33

S-ar putea să vă placă și