What is going on guys welcome back in today’s video we’re going to implement linear regression from scratch in python and a warning right ahead it’s gonna be mathematical so let’s get right into it all right so let’s get started with the very basics of linear regression what is

Linear regression and what can it be useful for and we’re going to start with an example right away let’s say we have a bunch of students and these students have an exam and they have a certain study time each of them has a certain study time that

They invested for this exam so let’s say we have the study time here it can go from zero hours up until not really infinity but pretty much infinity depending on what the maximum number is of the students so you can study two weeks in advance one year in advance

You can study one minute in advance uh from zero up until let’s say limitless and then we have an exam score and the exam score is the actual result of the test and it can go from zero points to a hundred points or from zero percent to a hundred percent

So what we can do now is let’s say we have a data set with these data points and we want to plot this here on a two-dimensional coordinate system that’s not the most beautiful coordinate system i know but let’s say the x-axis is the study time and the y-axis is the exam score

Now if we were to visualize all these points we would probably see something like that with a bunch of data points here people who don’t study at all and get pretty low scores people who study a little bit get better scores but still some that get pretty bad

Scores and all that um and we can see all these data points people that study a lot and get a lot of uh points or a very high score then some people who study a lot and don’t get very good grades at all some people who

Don’t study at all and get very good good grades those are the outliers but most of the points are going to be somewhere here in the middle now what linear regression tries to do is it tries tries to find a line that fits these points the best now best is a little bit

Subjective because certain regression types or certain Algorithms use different uh approaches because some say okay i don’t care about outliers i care a lot about outliers and so on depending on what you want you have to choose a different procedure but for linear regression what we’re interested in is minimizing the

Error and the error is basically let’s say i have a linear function here a blue line like that and this is obviously not the best function that we can find for this for these data points but the error of that function would be to basically go for each point and see

If this function was correct we would predict that for this x value here this would be the y value and then what we do is we just go down and see okay this here is the error and this here is the error and this here is the error

This is just a difference from the prediction to the actual reality that is the error okay and what we want to do is we want to find a function that minimizes that error and this is what linear regression is about so the structure of a function is basically

Of a linear function is basically y equals m times x plus b so m is basically the steepness and b is the distance here so we expect the line the final line to be something like that to be pretty much fitting the points maybe be influenced a

Little bit by the by the outliers and so on but all in all we want them to fit most points we want to minimize the error in total and for this of course what we need to do is we need to minimize the error function this the actual minimization

Process is a little bit more complicated but the error function in and of itself needs to be defined first so that we can minimize it because in order to minimize something with the gradient descent Algorithm that we’re going to talk about in a second but in order to minimize

Something it has to be something that produces a value because we want to minimize the output by tweaking the little things that we can’t weak in our case if we look at m times x plus b we want to manipulate we want to tweak m and b

So that for x i get the best possible y with the least error and the error function needs to be defined in order to minimize it so the error function is going to be e capital e is going to be 1 divided by n which is i’m going to to

Tell you in a minute becau in a second why it’s 1 divided by n times the sum from i equals 0 up until n and now we get y i which is the actual value that we get here so this point is y i and this

Point is y i the value of the y value of this point is the actual y i value and from this we subtract y i uh hat for example so this is just the the predicted value this is this value here and we can actually remove that and replace it by

M times x plus b so that is the error and we can square that and that would be the mean squared error and we can add an i here to make it a little bit more accurate so if you have never seen this before this is called the mean squared error function

Don’t be confused it’s not really complicated it may look complicated because it’s math and for a lot of people math looks just complicated but all we’re doing here is this is a mathematical way of writing that for each point and n is the amount of points for each

Point from zero from the zeroth point from the first point to the last point what we do is we get the y value of the actual point so this y value here or this y value here or this y value here the actual value and we subtract

Um the position of the function y value so if we have a function like this we will take the actual y value here and this y value here and this is the difference so this is what we’re going to get and then we square that difference and in the end we take all

These distances all these arrows that we have and we divide them by n which is the amount of points so we get the mean squared error the mean squared error so this is what this function does it’s just a fancy way of saying take the difference from all the actual

Points and what the function would predict the proper y value would be square that difference and divide it by the amount of points that we have to get the mean squared error this is the error function that we’re trying to minimize when it comes to linear regression all

Right so this next part here is going to be very technical and mathematical because we’re going to talk about partial derivatives about calculus so if you’re not interested in that or if you don’t understand calculus at all you can skip to the coding part directly i would not recommend it because even if

You don’t understand everything it’s good to just understand or listen to what is happening behind the scenes even if you don’t understand all of the math um so i would recommend you to watch that part as well but it may be a little bit boring because it’s very technical

And mathematical i personally think it’s one of the most exciting things to understand what is exactly happening how does this optimization work but i can understand if some of you guys say i don’t want to listen to that i want to get to the coding just keep in mind that

You won’t understand fully what is happening behind the scenes so let’s get into it um we want to minimize that error function we want to get the lowest possible e for our line we want to find the line that gives us the lowest possible e so

The only thing that we can influence is the m and the b the x is just the input and the y is just the output we want to find m and b so that we can minimize e that is our goal and how can we do that we can do that by

Taking the partial derivative with respect to m and with respect to b because that gives us the direction of the steepest ascend with respect to m and b so how can we change m to maximally increase e and how can we change b to maximally increase e now you might say okay

Didn’t we want to decrease e yes we did want to decrease e but if you just take the opposite direction if you have the direction of the steepest ascent you can just go the opposite direction and you have the direction of the steepest descent so this is what we’re going to do we’re

Going to take the partial derivative with respect to m and b and then we’re just going to go to the opposite direction of this gradient so we’re going to say the partial derivative of e with respect to m is going to be or is actually 1 divided by n times i equals 0

To n and now we have this squared here so we say 2 times those are just the basic calculus rule so if you don’t know calculus don’t be confused you don’t need to understand everything minus m times x i plus b and now we need to also multiply with the inside derivative

Which is basically just this is a sum so we just ignore it and this is the factor so what we end up with is negative x i um and now we can simplify that by just extracting the two and the negative so we can say okay this is nothing but just

Minus 2 divided by n times the sum from i equals 0 up until n and then we have x i times y i minus m times x i plus b so that is the partial derivative with respect to m now let’s do the same thing for b it’s

Basically the same but the difference is that we don’t have the x i because here we don’t have a factor so this is actually the same thing um we basically have negative 2 divided by n times sum from i to n and then just y i minus m times x i plus b

That is it and this those two things give us the direction of the steepest ascent with respect to m and b and all we need to do now is we need to go to the opposite direction that’s all we need to do so if we want to improve m

And b all we need to do is with each iteration we say take the current m and what you do is you assign to it the current or take the new m what you do is you assign to it the current m minus a learning rate we’re going to talk

About that in a second times the direction of the steepest ascent so basically e and m and the same for b b equals b minus l this is more of Programming notation not necessarily a mathematical notation but that is what we do so we know okay in this direction let’s say

Uh we increase the error the most so what we do is we just go the opposite direction and we do that with each iteration because it changes right so sometimes in this direction it’s going to be the steepest descent uh then in another direction then again in another direction especially if we

Don’t just deal with two variables but with many different variables we don’t do this only with uh score and with um study time sometimes we have like 10 20 a thousand different features that we have to take into account when doing a linear regression so it’s not always just two things

And this again this is the direction of the steepest ascent which is why we subtract because we don’t want to go to that direction we want to go to the opposite of that direction otherwise we would have a plus here and not a minus and the learning rate is basically how

Big uh how how large are the steps that we take now the larger the steps the faster we’re going to get to the actual optimization but the lower the learning rate the better it’s going to be in the end the the better the result is going to be because because we’re

Paying attention to details much more so i think we’re going to go with a learning rate of about 0.001 which is or maybe one more zero we’re going to do that and now that we have all the math handled we’re going to do this in python

Now we’re going to take all this theory and turn it into python code and for this we’re going to start by installing two libraries however since we’re implementing linear regression from scratch those two libraries are not going to be related to the linear regression algorithm we’re just going to

Use pandas to load a data set from a csv file and we’re going to use matpotlib for visualization the whole linear regression process will be implemented by us from scratch but if you don’t have pandas in math.lib you want to go to cmd and say pip install pandas and pip install

Matplotlib like that and once you have that you can just go ahead and say import pandas spd and import matplotlib.pyplot splt now what you’re going to need is some sort of data set you can craft your own you can just make up some values or you can if you’re

Comfortable with that take an actual data set and apply linear regression onto it for this video i have just crafted my own sample csv file just some random values that have a certain trend inside of them we have x y and then just a bunch of random values here randomly generated however

You can you can as i said go with a real data set and we can also interpret those as study time and score if you want to however i think we have some values above 100 as you can see here 117 so it’s not entirely accurate but we can

Just go ahead and call this study time and score so this is just a basic comma separated value file csv file with some random values we can load them in by saying data equals pd dot read csv data dot csv and then we can take a look at it by saying print data

To see the structure of the data frame nothing too fancy two columns study time score and the values we can also go ahead and visualize them by saying plt scatter data dot um what was it study time and data dot score then plt dot show like that

And there you go so those are the data points that we’re going to use for this regression example here you can actually pick any data points the data is not the focus here we just want to have a properly working algorithm so i’m going to delete the visualization

Here and we’re going to start with the loss function so we’re going to say def loss function you can also call this mean squared error and to this function we need to pass the m value the b value and the points that we have the actual data points so in our case data

And what we’re going to do here is we’re basically going to say we have a total error which starts at zero and what we do is we add all the individual squared errors to that and in the end we divide by the amount of points so we say 4i in range length and

What was it points we’re going to say the x value that we have is the points dot i log so at the location i we want to have the study time as the x value and then at the location i want to have the score as the y value and the error is

Basically just what we had already so if we look at my paint here uh what was the error function there you go here this is the error function we’re just gonna write it in a pythonic way now um so total error let me just see how i wrote it total

Error plus equals and then the loop is basically the sum and what we write in here is the iteration of this sum uh sigma symbol so we say to the total error we want to add y the actual y point minus what we thought then y point should be based on

M and b so m times x plus b and the whole thing squared because it’s the mean squared error and in the end what we do is we just say total error divided by float length of points like that so that is the basic loss function it

Tells us how off how much we’re off from the actual result now what you need to know about this loss function here is that we’re actually not going to use it because what we’re actually interested in is just minimizing it and it’s already included in the gradient descent because

We cannot just have this function and tell python take the derivative of it so we need to do it manually and we have done it manually already so this function is more like a function that you can use if you want to calculate the loss manually

But we’re actually not going to use it in the final optimization process so let’s go ahead and implement the gradient descent we’re going to call this function gradient descent we’re going to pass here m now b now the current values here we’re going to pass the points and want to pass a learning

Rate l um and what we’re going to do now is we’re basically going to just um start with a gradient for m of 0 and for b as well and then we’re going to say n is just the length of the points the amount of points

And what we’re going to do now is we’re going to just perform the gradient descent again we’re not going to use that function this is just a separate function that you can use manually here we’re going to already use the the formulas that we had here so

Those two formulas here that we already talked about these partial derivatives and those already include the loss function to some degree so what we’re going to do now is we’re going to say 4i in range in so for each point we’re going to do is we’re going to take that point

So x equals then points i lock i dot study time and then y equals i lock i dot score and what we do now is we calculate the gradient based on the function here again all i’m doing here is just typing these two functions to these two lines

This negative 2 divided by n whatever the loop is the sum symbol here all i’m doing here is just putting this into python code if you want to understand what’s happening here you need to go back to the mathematical explanation so we’re just going to say m gradient plus equals and then negative

2 divided by n actually the negative is out here negative 2 divided by n Times x times y minus m at this particular moment in time times x plus b at this moment in time and then b is basically the same but without the x there you go um then once we have that once these iterations are done we know what

Uh which direction we have to move into or away from actually so what we do is we basically say okay the new m is going to be the m now um and we’re going to say minus m gradient so in the opposite direction but with a learning rate of l

That determines how much we move same for b and that is actually it and in the end of course we return m and b both values that is the gradient descent function we can actually get rid of that function if you’re not interested in it

But that is all we need to do in order to perform the linear regression so now we just need to execute everything we say okay what is uh m we start with zero b we start with zero we can pick any starting values if we want to we will

Use a learning rate of 0.0001 and we’re going to use 100 iterations also called epochs we can also go with thousand actually so let’s go with a thousand and now what we do is we just say four i in range epochs so for the amount of iterations here we’re just going to say

M and b are going to be gradient descent off m b and the data and the learning rate so we’re constantly going to get a better and better and better and better at estimating the perfect m and b and in the end we can print not plt we can print m and b

And we can plot the results we can say plt dot scatter data dot study time data dot score color is going to be black and then plt dot plot and we can plot the trend line here which is basically just a list and we can look at the csv file all the

Values should be uh all the x values at least should be more than 20 and less than 80 so we can just go with list range 20 to 80 just to have some values and then we’re going to say m times x plus b for x in range 20 80.

There you go and the color of the regression line will be red then plt dot that’s it you can now run this maybe you want to print the epoch so let’s say if i modulo 50 equals 0 print epoch i like that inside of the string maybe there you go

Let’s see if that works or if we made mistakes epos0 50 100 let’s just turn the number down so we get the results faster let’s go with 300 and then we should see pretty decent results 0 50 100 and we should see a pretty solid trend line here or regression line

There you go seems about right as you can see this is a pretty good actually i think it’s the optimal linear regression line here because we had like 300 iterations those are the values here for m and b so this is how you implement linear regression from scratch in python

So that’s it for today’s video hope you enjoyed hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don’t forget to subscribe to this channel and hit the notification bell to not miss a single

Future video for free other than that thank you very much for watching see you next video and bye You

In this video we implement the linear regression algorithm from scratch. This episode is highly mathematical.

Programming Books & Merch

The Python Bible Book:

The Algorithm Bible Book:

Programming Merch:

Social Media & Contact

Website:

Instagram:

Twitter:

LinkedIn:

GitHub:

Discord:

Outro Music From:

Timestamps:

(0:00) Intro

(0:19) Mathematical Theory

(12:48) Implementation From Scratch

(24:05) Outro

00:04 [Music]

00:09 what is going on guys welcome back in

00:10 today’s video we’re going to implement

00:12 linear regression from scratch in python

00:14 and a warning right ahead it’s gonna be

00:16 mathematical so let’s get right into it

00:19 all right so let’s get started with the

00:20 very basics of linear regression what is

00:22 linear regression and what can it be

00:24 useful for and we’re going to start with

00:26 an example right away let’s say we have

00:28 a bunch of students and these students

00:30 have an exam

00:31 and they have a certain study time each

00:33 of them has a certain study time that

00:35 they invested for this exam so let’s say

00:37 we have the study time here it can go

00:40 from zero hours up until

00:42 not really infinity but pretty much

00:45 infinity depending on what the maximum

00:46 number is of the students so you can

00:48 study

00:49 two weeks in advance one year in advance

00:51 you can study one minute in advance uh

00:53 from zero up until let’s say limitless

00:55 and then we have an exam score

00:58 and the exam score is the actual result

01:00 of the test and it can go from zero

01:02 points to a hundred points or from zero

01:04 percent to a hundred percent

01:06 so what we can do now is let’s say we

01:08 have a data set with these data points

01:10 and we want to plot this here on a

01:12 two-dimensional coordinate system that’s

01:14 not the most beautiful coordinate system

01:16 i know

01:17 but let’s say the x-axis is the study

01:19 time

01:20 and the y-axis is the exam score

01:24 now if we were to visualize all these

01:26 points we would probably see something

01:28 like that with a bunch of data points

01:29 here

01:30 people who don’t study at all and get

01:32 pretty low scores people who study a

01:34 little bit get better scores

01:36 but still some that get pretty bad

01:38 scores and all that um and we can see

01:40 all these data points people that study

01:42 a lot and get a lot of

01:43 uh points or a very high score then some

01:46 people who study a lot and don’t get

01:48 very good grades at all some people who

01:50 don’t study at all and get very good

01:52 good grades those are the outliers but

01:53 most of the points are going to be

01:55 somewhere here in the middle

01:56 now what linear regression tries to do

01:58 is it tries tries to find a line that

02:01 fits these points the best now best is a

02:04 little bit

02:06 subjective because certain

02:08 regression types or certain algorithms

02:10 use different

02:12 uh approaches because some say okay i

02:14 don’t care about outliers i care a lot

02:16 about

02:17 outliers and so on depending on what you

02:19 want you have to choose a different

02:21 procedure but for linear regression what

02:23 we’re interested in is minimizing the

02:25 error and the error is basically let’s

02:28 say i have a linear function here

02:31 a blue line like that and this is

02:33 obviously not the best function that we

02:34 can find for this for these data points

02:37 but the error of that function

02:39 would be to basically go for each point

02:42 and see

02:43 if this function was correct we would

02:45 predict that for this x value here

02:47 this would be the y value and then what

02:49 we do is we just go down and see okay

02:51 this here is the error and this here is

02:53 the error and this here is the error

02:55 this is just a difference from the

02:57 prediction to the actual reality that is

02:59 the error

03:00 okay

03:01 and what we want to do is we want to

03:02 find a function that minimizes that

03:05 error and this is what linear regression

03:07 is about

03:08 so

03:09 the structure of a function is basically

03:12 of a linear function is basically

03:16 y equals m times x plus b so m is

03:20 basically the steepness and b is the

03:22 distance here so

03:24 we expect the line the final line to be

03:26 something like that to be pretty much

03:28 fitting the points maybe be influenced a

03:30 little bit by the by the outliers and so

03:32 on but all in all we want them to fit

03:35 most points we want to minimize the

03:36 error in total

03:39 and for this of course what we need to

03:40 do is we need to minimize the error

03:42 function this the actual minimization

03:45 process is a little bit more complicated

03:47 but the error function in and of itself

03:49 needs to be defined first so that we can

03:51 minimize it because in order to minimize

03:53 something with the gradient descent

03:55 algorithm that we’re going to talk about

03:56 in a second but in order to minimize

03:58 something it has to be something that

04:00 produces a value because we want to

04:02 minimize the output by tweaking the

04:04 little things that we can’t weak in our

04:06 case

04:08 if we look at m times x plus b we want

04:11 to manipulate we want to tweak m and b

04:14 so that for x i get the best possible y

04:17 with the least error

04:19 and the error function needs to be

04:21 defined in order to minimize it so the

04:22 error function is going to be

04:25 e capital e is going to be

04:28 1 divided by n which is i’m going to to

04:32 tell you in a minute becau in a second

04:33 why it’s 1 divided by n

04:36 times the sum

04:38 from i equals 0 up until n and now we

04:42 get

04:43 y i which is the actual value that we

04:46 get here so this point is y i and this

04:49 point is y i the value of the y value of

04:51 this point is the actual y i value and

04:54 from this we subtract

04:57 y

04:58 i

04:59 uh hat for example so this is just the

05:02 the predicted value this is this value

05:04 here

05:04 and we can actually remove that and

05:06 replace it by

05:08 m times x plus b

05:12 so that is the error and we can square

05:14 that and that would be the mean squared

05:17 error and we can add an i here to make

05:19 it a little bit more accurate so

05:21 if you have never seen this before this

05:23 is called the mean squared error

05:24 function

05:25 don’t be confused it’s not really

05:27 complicated it may look complicated

05:29 because it’s math and for a lot of

05:30 people math looks just complicated but

05:32 all we’re doing here is this is a

05:34 mathematical way of writing that for

05:37 each point

05:38 and n is the amount of points for each

05:40 point from zero from the zeroth point

05:43 from the first point to the last point

05:45 what we do is we get the y value of the

05:48 actual point so this y value here or

05:50 this y value here or this y value here

05:52 the actual value

05:54 and we subtract

05:57 um the position of the function y value

06:00 so if we have a function like this we

06:02 will take the actual y value here and

06:05 this y value here and this is the

06:07 difference so this is what we’re going

06:08 to get and then we square that

06:10 difference and in the end we take all

06:13 these distances all these arrows that we

06:15 have and we divide them by n which is

06:17 the amount of points so we get the mean

06:20 squared error the mean

06:23 squared

06:25 error

06:27 so this is what this function does it’s

06:28 just a fancy way of saying

06:30 take the difference from all the actual

06:32 points and what the function would

06:35 predict the proper y value would be

06:38 square that difference and divide it by

06:40 the amount of points that we have to get

06:42 the mean squared error this is the error

06:44 function that we’re trying to minimize

06:45 when it comes to linear regression all

06:48 right so this next part here is going to

06:49 be very technical and mathematical

06:51 because we’re going to talk about

06:52 partial derivatives about calculus so if

06:55 you’re not interested in that or if you

06:57 don’t understand calculus at all you can

06:59 skip to the coding part directly

07:01 i would not recommend it because even if

07:03 you don’t understand everything it’s

07:05 good to just understand or listen to

07:07 what is happening behind the scenes even

07:09 if you don’t understand all of the math

07:11 um so i would recommend you to watch

07:13 that part as well but it may be a little

07:15 bit boring because it’s very technical

07:17 and mathematical i personally think it’s

07:18 one of the most exciting things to

07:20 understand what is exactly happening how

07:22 does this optimization work but i can

07:24 understand if some of you guys say i

07:26 don’t want to listen to that i want to

07:27 get to the coding just keep in mind that

07:29 you won’t understand fully what is

07:30 happening behind the scenes

07:32 so let’s get into it um we want to

07:35 minimize that error function we want to

07:37 get the lowest possible e

07:39 for our line we want to find the line

07:41 that gives us the lowest possible e so

07:44 the only thing that we can influence is

07:45 the m

07:46 and the b

07:48 the x is just the input and the y is

07:50 just the output we want to find m and b

07:53 so that we can minimize e that is our

07:56 goal

07:57 and how can we do that we can do that by

07:59 taking the partial derivative with

08:02 respect to m and with respect to b

08:04 because that gives us the direction of

08:07 the steepest ascend with respect to m

08:09 and b so how can we change m to

08:12 maximally increase e

08:15 and how can we change b to maximally

08:17 increase e now you might say okay

08:20 didn’t we want to decrease e yes we did

08:23 want to decrease e but if you just take

08:25 the opposite direction if you have the

08:26 direction of the steepest ascent you can

08:29 just go the opposite direction and you

08:31 have the direction of the steepest

08:32 descent

08:33 so this is what we’re going to do we’re

08:35 going to take the partial derivative

08:37 with respect to m and b and then we’re

08:39 just going to go to the opposite

08:40 direction of this gradient

08:42 so we’re going to say the partial

08:43 derivative of e

08:45 with respect to m

08:48 is going to be or is actually

08:51 1 divided by n times

08:54 i equals 0

08:57 to n and now we have this squared here

08:59 so we say 2 times

09:01 those are just the basic calculus rule

09:03 so if you don’t know calculus don’t be

09:05 confused you don’t need to understand

09:06 everything

09:09 minus m times x i plus b and now we need

09:13 to also

09:14 multiply with the inside derivative

09:16 which is basically just this is a

09:19 sum so we just ignore it and this is the

09:21 factor so what we end up with is

09:24 negative

09:25 x i

09:28 um

09:29 and now we can simplify that by just

09:31 extracting the two and the negative so

09:32 we can say okay this is nothing

09:35 but just

09:36 minus 2 divided by n times the sum

09:40 from i equals 0

09:42 up until n

09:43 and then we have x i

09:46 times

09:47 y i minus m

09:50 times x i

09:53 plus b so that is the partial derivative

09:56 with respect to m

09:58 now let’s do the same thing for b it’s

10:00 basically the same but the difference is

10:02 that we don’t have the x i because

10:05 here we don’t have a factor so

10:07 this is actually the same thing

10:09 um

10:10 we basically have negative

10:13 2 divided by n

10:15 times

10:18 sum from i to n

10:21 and then just

10:22 y i minus

10:24 m times

10:25 x i plus b

10:28 that is it and this those two things

10:30 give us the direction of the steepest

10:32 ascent with respect to m and b and all

10:35 we need to do now is we need to go to

10:38 the opposite direction that’s all we

10:40 need to do so if we want to improve m

10:43 and b all we need to do is with each

10:45 iteration we say

10:46 take the current m and what you do is

10:49 you assign to it

10:51 the current or take the new m what you

10:54 do is you assign to it the current m

10:57 minus

10:58 a learning rate we’re going to talk

11:00 about that in a second

11:01 times

11:02 the direction of the steepest ascent

11:06 so

11:08 basically

11:09 e

11:11 and m and the same for b b equals b

11:14 minus l this is more of programming

11:16 notation not necessarily a mathematical

11:18 notation

11:20 but that is

11:22 what we do

11:23 so we know okay in this direction let’s

11:25 say

11:27 uh we increase the error the most so

11:29 what we do is we just go the opposite

11:31 direction and we do that with each

11:33 iteration because it changes right so

11:35 sometimes in this direction it’s going

11:37 to be the steepest descent

11:39 uh then in another direction then again

11:41 in another direction especially if we

11:42 don’t just deal with two variables but

11:44 with many different variables we don’t

11:46 do this only with uh score and with um

11:51 study time sometimes we have like 10 20

11:54 a thousand different features that we

11:55 have to take into account when doing a

11:56 linear regression so it’s not always

11:58 just two things

12:01 and this again this is the direction of

12:03 the steepest ascent which is why we

12:05 subtract because we don’t want to go to

12:06 that direction we want to go to the

12:08 opposite of that direction otherwise we

12:10 would have a plus here and not a minus

12:12 and the learning rate is basically how

12:15 big uh how how large are the steps that

12:17 we take now the larger the steps the

12:20 faster we’re going to get to the actual

12:22 optimization

12:23 but the lower the learning rate the

12:25 better it’s going to be in the end the

12:27 the better the result is going to be

12:29 because because we’re

12:31 paying attention to details much more

12:34 so i think we’re going to go with a

12:35 learning rate of

12:37 about

12:38 0.001 which is or maybe one more zero

12:43 we’re going to do that and now that we

12:45 have all the math handled we’re going to

12:46 do this in python

12:48 now we’re going to take all this theory

12:50 and turn it into python code and for

12:52 this we’re going to start by installing

12:53 two libraries however since we’re

12:55 implementing linear regression from

12:57 scratch those two libraries are not

12:59 going to be related to the linear

13:01 regression algorithm we’re just going to

13:02 use pandas to load a data set from a csv

13:05 file and we’re going to use matpotlib

13:07 for visualization the whole linear

13:09 regression process will be implemented

13:11 by us from scratch

13:13 but if you don’t have pandas in math.lib

13:15 you want to go to cmd and say pip

13:17 install

13:18 pandas and pip install

13:21 matplotlib like that and once you have

13:24 that you can just go ahead and say

13:25 import pandas spd

13:28 and import

13:29 matplotlib.pyplot splt

13:32 now what you’re going to need is some

13:34 sort of data set you can craft your own

13:36 you can just

13:37 make up some values or you can if you’re

13:40 comfortable with that take an actual

13:42 data set and apply linear regression

13:43 onto it for this video i have just

13:45 crafted my own sample csv file just some

13:48 random values that have a certain trend

13:50 inside of them

13:51 we have x y and then just a bunch of

13:53 random values here randomly generated

13:56 however

13:58 you can you can as i said go with a real

14:00 data set and we can also interpret those

14:02 as study time and score if you want to

14:05 however i think we have some values

14:06 above 100 as you can see here 117

14:09 so it’s not entirely accurate but we can

14:11 just go ahead and call this study time

14:14 and

14:15 score

14:17 so this is just a basic comma separated

14:19 value file csv file with some random

14:22 values we can load them in by saying

14:25 data equals pd

14:27 dot read csv

14:29 data dot csv and then we can take a look

14:31 at it by saying print data

14:33 to see the structure of the data frame

14:37 nothing too fancy

14:38 two columns study time score and the

14:40 values we can also go ahead and

14:42 visualize them by saying plt scatter

14:46 data dot um what was it study time

14:50 and data dot score

14:52 then plt dot show

14:55 like that

14:59 and there you go so those are the data

15:01 points that we’re going to use for this

15:02 regression example here you can actually

15:04 pick any data points the data is not the

15:06 focus here we just want to have a

15:08 properly working algorithm

15:10 so i’m going to delete the visualization

15:13 here and we’re going to start with the

15:14 loss function so we’re going to say def

15:16 loss function you can also call this

15:18 mean squared error and to this function

15:20 we need to pass the m value the b value

15:23 and the points that we have the actual

15:26 data points so in our case data

15:30 and what we’re going to do here is we’re

15:31 basically going to say we have a total

15:33 error which starts at zero and what we

15:35 do is we add all the individual squared

15:37 errors to that and in the end we divide

15:39 by the amount of points so we say 4i in

15:42 range

15:43 length

15:44 and

15:45 what was it

15:46 points

15:49 we’re going to say

15:51 the x value that we have is

15:55 the

15:55 points dot i log so at the location i

15:59 we want to have the study time as the x

16:01 value and then at the location i want to

16:04 have the score

16:05 as the y value and the error is

16:08 basically just what we had already so if

16:11 we look at my paint here

16:14 uh what was the error function

16:17 there you go here this is the error

16:19 function we’re just gonna write it in a

16:21 pythonic way now

16:24 um so total error

16:27 let me just see how i wrote it total

16:29 error plus equals and then

16:32 the loop is basically the sum and what

16:34 we write in here is the iteration of

16:36 this sum

16:37 uh sigma symbol so we say to the total

16:40 error we want to add

16:42 y the actual y point minus what we

16:45 thought then y point should be based on

16:47 m and b

16:48 so m times x plus b and the whole thing

16:52 squared because it’s the mean squared

16:54 error

16:56 and in the end what we do is we just say

16:58 total error

16:59 divided by

17:01 float

17:02 length of points

17:04 like that

17:06 so that is the basic loss function it

17:08 tells us how off how much we’re off from

17:10 the actual result

17:12 now what you need to know about this

17:14 loss function here is that we’re

17:15 actually not going to use it because

17:16 what we’re actually interested in is

17:18 just minimizing it and it’s already

17:20 included in the gradient descent because

17:22 we cannot just have this function and

17:24 tell python take the derivative of it so

17:26 we need to do it manually and we have

17:28 done it manually already

17:30 so this function is more like a function

17:32 that you can use if you want to

17:33 calculate the loss manually

17:35 but we’re actually not going to use it

17:37 in the final optimization process

17:40 so let’s go ahead and implement the

17:41 gradient descent we’re going to call

17:43 this function gradient descent

17:46 we’re going to pass here m now b now the

17:49 current values here we’re going to pass

17:51 the points and want to pass a learning

17:53 rate l

17:56 um and what we’re going to do now is

17:57 we’re basically going to just um start

18:00 with a gradient for m of 0 and for b as

18:04 well

18:07 and then we’re going to say n is just

18:09 the length of the points the amount of

18:11 points

18:12 and what we’re going to do now is we’re

18:13 going to just

18:15 perform the gradient descent again we’re

18:16 not going to use that function this is

18:18 just a separate function that you can

18:20 use manually here we’re going to already

18:23 use the the formulas that we had here so

18:26 those two formulas here that we already

18:28 talked about

18:29 these partial derivatives

18:32 and those already include

18:34 the

18:35 loss function to some degree

18:38 so what we’re going to do now is we’re

18:39 going to say 4i in range in so for each

18:43 point we’re going to do is we’re going

18:44 to take that point

18:46 so x equals

18:49 then points

18:52 i lock i dot study time

18:58 and then

18:59 y

19:00 equals i lock i dot

19:04 score

19:07 and what we do now is we calculate the

19:09 gradient based on the function here

19:11 again all i’m doing here is just typing

19:14 these two functions to these two lines

19:17 this negative 2 divided by n whatever

19:20 the loop is the sum symbol here

19:22 all i’m doing here is just putting this

19:24 into python code if you want to

19:26 understand what’s happening here you

19:27 need to go back to the mathematical

19:30 explanation

19:31 so

19:32 we’re just going to say m gradient

19:35 plus equals and then

19:37 negative

19:38 2 divided by n actually the negative is

19:41 out here negative 2 divided by n

19:43 [Music]

19:44 times x

19:46 times

19:47 y

19:50 minus

19:52 m at this particular moment in time

19:56 times x

19:57 plus b at this moment in time

20:00 and then b is basically the same but

20:03 without the

20:04 x there you go

20:08 um

20:09 then once we have that once these

20:10 iterations are done we know what

20:13 uh which direction we have to move into

20:16 or away from actually so what we do is

20:18 we basically say okay the new m is going

20:20 to be the m now

20:22 um and we’re going to

20:25 say minus

20:27 m

20:28 gradient so in the opposite direction

20:31 but

20:32 with a learning rate of l

20:34 that determines how much we move

20:37 same for b

20:40 and

20:42 that is actually it and in the end of

20:44 course we return m and b

20:46 both values that is the gradient descent

20:49 function we can actually get rid of that

20:50 function if you’re not interested in it

20:52 but that is all we need to do in order

20:54 to perform the linear regression so now

20:56 we just need to execute everything we

20:59 say okay what is uh m we start with zero

21:02 b we start with zero we can pick any

21:04 starting values if we want to we will

21:06 use a learning rate of 0.0001

21:10 and we’re going to use

21:12 100 iterations also called epochs we can

21:15 also go with thousand actually so let’s

21:17 go with a thousand

21:19 and now what we do is we just say four i

21:22 in range

21:24 epochs so for the amount of iterations

21:27 here we’re just going to say

21:29 m and b are going to be

21:32 gradient descent

21:34 off

21:35 m

21:36 b and the data

21:37 and the learning rate

21:39 so we’re constantly going to get a

21:41 better and better and better and better

21:43 at estimating the perfect m and b

21:47 and in the end we can print

21:49 not plt we can print m and b

21:54 and we can plot the results we can say

21:56 plt dot scatter data dot study time

22:00 data dot score

22:02 color is going to be black

22:06 and then plt dot plot and we can plot

22:09 the trend line here which is basically

22:11 just a list

22:13 and we can look at the csv file all the

22:15 values should be

22:18 uh all the x values at least should be

22:20 more than 20 and less than

22:23 80 so we can just go with list

22:26 range

22:28 20 to 80 just to have some values and

22:30 then we’re going to say

22:33 m times x

22:35 plus

22:36 b

22:37 for x

22:39 in range

22:41 20

22:43 80.

22:45 there you go and the color of the

22:46 regression line will be

22:48 red

22:51 then plt dot

22:53 that’s it

22:54 you can now run this maybe you want to

22:56 print the epoch so let’s say if i modulo

23:00 50 equals 0 print

23:03 epoch

23:05 i

23:06 like that

23:10 inside of the string maybe

23:15 there you go

23:18 let’s see if that works

23:20 or if we made mistakes

23:23 epos0 50 100

23:26 let’s just turn the number down so we

23:28 get the results faster let’s go with 300

23:34 and then we should see pretty decent

23:36 results 0

23:38 50

23:39 100

23:42 and we should see a pretty solid trend

23:44 line here or regression line

23:47 there you go seems about right as you

23:50 can see this is a pretty good

23:52 actually i think it’s

23:54 the optimal linear regression line here

23:56 because we had like 300 iterations those

23:58 are the values here for m and b

24:01 so this is how you implement linear

24:02 regression from scratch in python

24:05 so that’s it for today’s video hope you

24:07 enjoyed hope you learned something if so

24:08 let me know by hitting a like button and

24:10 leaving a comment in the comment section

24:11 down below and of course don’t forget to

24:13 subscribe to this channel and hit the

24:14 notification bell to not miss a single

24:16 future video for free other than that

24:18 thank you very much for watching see you

24:19 next video and bye

24:21 [Music]

24:37 you

LinearRegressionML

this video is amazing, good job, im now actually thinking about re visiting the math classes i couldn't take before in order to get better at these machine learning algorithims.

A savior

A savior

got a good idea about linear regression thnx bud!!

Thanks man

Please PROVIDE your DATASET

You should not to "think" that it is a best line, you should verify it!

This is amazing!

How can I store the current m and b values for every iteration?

why did we do a partial derivative?

Brother! I simply love you after I came across this video!

Awesome tutorial! Could you please explain why you use gradient descent to minimize squared error instead of using the formula: divide the standard deviation of y values by the standard deviation of x values and then multiply this by the correlation between x and y?

Nice video! Could you please provide the dataset that you used

I implemented it myself and it came out to be more accurate then sckit-learn

import numpy as np

from sympy import Symbol, solve, diff

class LinearRegression:

def __init__(self):

self.w = {}

def fit(self, x, y):

for i in np.arange(x.shape[1] + 1):

self.w[f"w{i}"] = Symbol(f"w{i}")

e = 0

for i in range(len(x)):

yp = 0

for j in np.arange(len(self.w)):

if j == 0:

yp += self.w[f"w{j}"]

else:

yp += self.w[f"w{j}"] * x[i][j-1]

e += (yp-y[i]) ** 2

eq = []

for i in np.arange(len(self.w)):

eq.append(diff(e, self.w[f"w{i}"]))

w = solve(eq, list(self.w.keys()))

for i in np.arange(len(self.w)):

self.w[f"w{i}"] = w[self.w[f"w{i}"]]

def predict(self, x):

def prediction(features):

yp = 0

for i in np.arange(len(self.w)):

if i == 0:

yp += self.w[f"w{i}"]

else:

yp += self.w[f"w{i}"] * features[i-1]

return yp

return list(map(prediction, x))

I am getting error

AttributeError Traceback (most recent call last)

Cell In[10], line 39

37 if i % 50 == 0:

38 print(f"Epoch: {i}")

—> 39 m, b = gradient_descent(m, b, data, L)

41 print(m, b)

43 plt.scatter(data.studytime, data.score, color = "black")

Hi, I really like your videos. One question, what is h?

Thank you!

please provide the csv file

I am still in 2nd prep (14 years old) you are talking not talking English I have a stupid mind ig

great video NeuralNine

I feel this the gradiant descent. is not it ?

if It is, is there an implementation for the least square

because I feel it is just so random you take some values and you choose the least one from those random values not necessarily the least value you can get

You sound like tech with Tim😂😂😂

Thank you so much. Im working through problem set for a Neural Analysis class and this really helps. Great video.