Home Business Analyst BA Agile Coach Linear Regression From Scratch in Python (Mathematical)

Linear Regression From Scratch in Python (Mathematical)

53
20

What is going on guys welcome back in today’s video we’re going to implement linear regression from scratch in python and a warning right ahead it’s gonna be mathematical so let’s get right into it all right so let’s get started with the very basics of linear regression what is

Linear regression and what can it be useful for and we’re going to start with an example right away let’s say we have a bunch of students and these students have an exam and they have a certain study time each of them has a certain study time that

They invested for this exam so let’s say we have the study time here it can go from zero hours up until not really infinity but pretty much infinity depending on what the maximum number is of the students so you can study two weeks in advance one year in advance

You can study one minute in advance uh from zero up until let’s say limitless and then we have an exam score and the exam score is the actual result of the test and it can go from zero points to a hundred points or from zero percent to a hundred percent

So what we can do now is let’s say we have a data set with these data points and we want to plot this here on a two-dimensional coordinate system that’s not the most beautiful coordinate system i know but let’s say the x-axis is the study time and the y-axis is the exam score

Now if we were to visualize all these points we would probably see something like that with a bunch of data points here people who don’t study at all and get pretty low scores people who study a little bit get better scores but still some that get pretty bad

Scores and all that um and we can see all these data points people that study a lot and get a lot of uh points or a very high score then some people who study a lot and don’t get very good grades at all some people who

Don’t study at all and get very good good grades those are the outliers but most of the points are going to be somewhere here in the middle now what linear regression tries to do is it tries tries to find a line that fits these points the best now best is a little bit

Subjective because certain regression types or certain Algorithms use different uh approaches because some say okay i don’t care about outliers i care a lot about outliers and so on depending on what you want you have to choose a different procedure but for linear regression what we’re interested in is minimizing the

Error and the error is basically let’s say i have a linear function here a blue line like that and this is obviously not the best function that we can find for this for these data points but the error of that function would be to basically go for each point and see

If this function was correct we would predict that for this x value here this would be the y value and then what we do is we just go down and see okay this here is the error and this here is the error and this here is the error

This is just a difference from the prediction to the actual reality that is the error okay and what we want to do is we want to find a function that minimizes that error and this is what linear regression is about so the structure of a function is basically

Of a linear function is basically y equals m times x plus b so m is basically the steepness and b is the distance here so we expect the line the final line to be something like that to be pretty much fitting the points maybe be influenced a

Little bit by the by the outliers and so on but all in all we want them to fit most points we want to minimize the error in total and for this of course what we need to do is we need to minimize the error function this the actual minimization

Process is a little bit more complicated but the error function in and of itself needs to be defined first so that we can minimize it because in order to minimize something with the gradient descent Algorithm that we’re going to talk about in a second but in order to minimize

Something it has to be something that produces a value because we want to minimize the output by tweaking the little things that we can’t weak in our case if we look at m times x plus b we want to manipulate we want to tweak m and b

So that for x i get the best possible y with the least error and the error function needs to be defined in order to minimize it so the error function is going to be e capital e is going to be 1 divided by n which is i’m going to to

Tell you in a minute becau in a second why it’s 1 divided by n times the sum from i equals 0 up until n and now we get y i which is the actual value that we get here so this point is y i and this

Point is y i the value of the y value of this point is the actual y i value and from this we subtract y i uh hat for example so this is just the the predicted value this is this value here and we can actually remove that and replace it by

M times x plus b so that is the error and we can square that and that would be the mean squared error and we can add an i here to make it a little bit more accurate so if you have never seen this before this is called the mean squared error function

Don’t be confused it’s not really complicated it may look complicated because it’s math and for a lot of people math looks just complicated but all we’re doing here is this is a mathematical way of writing that for each point and n is the amount of points for each

Point from zero from the zeroth point from the first point to the last point what we do is we get the y value of the actual point so this y value here or this y value here or this y value here the actual value and we subtract

Um the position of the function y value so if we have a function like this we will take the actual y value here and this y value here and this is the difference so this is what we’re going to get and then we square that difference and in the end we take all

These distances all these arrows that we have and we divide them by n which is the amount of points so we get the mean squared error the mean squared error so this is what this function does it’s just a fancy way of saying take the difference from all the actual

Points and what the function would predict the proper y value would be square that difference and divide it by the amount of points that we have to get the mean squared error this is the error function that we’re trying to minimize when it comes to linear regression all

Right so this next part here is going to be very technical and mathematical because we’re going to talk about partial derivatives about calculus so if you’re not interested in that or if you don’t understand calculus at all you can skip to the coding part directly i would not recommend it because even if

You don’t understand everything it’s good to just understand or listen to what is happening behind the scenes even if you don’t understand all of the math um so i would recommend you to watch that part as well but it may be a little bit boring because it’s very technical

And mathematical i personally think it’s one of the most exciting things to understand what is exactly happening how does this optimization work but i can understand if some of you guys say i don’t want to listen to that i want to get to the coding just keep in mind that

You won’t understand fully what is happening behind the scenes so let’s get into it um we want to minimize that error function we want to get the lowest possible e for our line we want to find the line that gives us the lowest possible e so

The only thing that we can influence is the m and the b the x is just the input and the y is just the output we want to find m and b so that we can minimize e that is our goal and how can we do that we can do that by

Taking the partial derivative with respect to m and with respect to b because that gives us the direction of the steepest ascend with respect to m and b so how can we change m to maximally increase e and how can we change b to maximally increase e now you might say okay

Didn’t we want to decrease e yes we did want to decrease e but if you just take the opposite direction if you have the direction of the steepest ascent you can just go the opposite direction and you have the direction of the steepest descent so this is what we’re going to do we’re

Going to take the partial derivative with respect to m and b and then we’re just going to go to the opposite direction of this gradient so we’re going to say the partial derivative of e with respect to m is going to be or is actually 1 divided by n times i equals 0

To n and now we have this squared here so we say 2 times those are just the basic calculus rule so if you don’t know calculus don’t be confused you don’t need to understand everything minus m times x i plus b and now we need to also multiply with the inside derivative

Which is basically just this is a sum so we just ignore it and this is the factor so what we end up with is negative x i um and now we can simplify that by just extracting the two and the negative so we can say okay this is nothing but just

Minus 2 divided by n times the sum from i equals 0 up until n and then we have x i times y i minus m times x i plus b so that is the partial derivative with respect to m now let’s do the same thing for b it’s

Basically the same but the difference is that we don’t have the x i because here we don’t have a factor so this is actually the same thing um we basically have negative 2 divided by n times sum from i to n and then just y i minus m times x i plus b

That is it and this those two things give us the direction of the steepest ascent with respect to m and b and all we need to do now is we need to go to the opposite direction that’s all we need to do so if we want to improve m

And b all we need to do is with each iteration we say take the current m and what you do is you assign to it the current or take the new m what you do is you assign to it the current m minus a learning rate we’re going to talk

About that in a second times the direction of the steepest ascent so basically e and m and the same for b b equals b minus l this is more of Programming notation not necessarily a mathematical notation but that is what we do so we know okay in this direction let’s say

Uh we increase the error the most so what we do is we just go the opposite direction and we do that with each iteration because it changes right so sometimes in this direction it’s going to be the steepest descent uh then in another direction then again in another direction especially if we

Don’t just deal with two variables but with many different variables we don’t do this only with uh score and with um study time sometimes we have like 10 20 a thousand different features that we have to take into account when doing a linear regression so it’s not always just two things

And this again this is the direction of the steepest ascent which is why we subtract because we don’t want to go to that direction we want to go to the opposite of that direction otherwise we would have a plus here and not a minus and the learning rate is basically how

Big uh how how large are the steps that we take now the larger the steps the faster we’re going to get to the actual optimization but the lower the learning rate the better it’s going to be in the end the the better the result is going to be because because we’re

Paying attention to details much more so i think we’re going to go with a learning rate of about 0.001 which is or maybe one more zero we’re going to do that and now that we have all the math handled we’re going to do this in python

Now we’re going to take all this theory and turn it into python code and for this we’re going to start by installing two libraries however since we’re implementing linear regression from scratch those two libraries are not going to be related to the linear regression algorithm we’re just going to

Use pandas to load a data set from a csv file and we’re going to use matpotlib for visualization the whole linear regression process will be implemented by us from scratch but if you don’t have pandas in math.lib you want to go to cmd and say pip install pandas and pip install

Matplotlib like that and once you have that you can just go ahead and say import pandas spd and import matplotlib.pyplot splt now what you’re going to need is some sort of data set you can craft your own you can just make up some values or you can if you’re

Comfortable with that take an actual data set and apply linear regression onto it for this video i have just crafted my own sample csv file just some random values that have a certain trend inside of them we have x y and then just a bunch of random values here randomly generated however

You can you can as i said go with a real data set and we can also interpret those as study time and score if you want to however i think we have some values above 100 as you can see here 117 so it’s not entirely accurate but we can

Just go ahead and call this study time and score so this is just a basic comma separated value file csv file with some random values we can load them in by saying data equals pd dot read csv data dot csv and then we can take a look at it by saying print data

To see the structure of the data frame nothing too fancy two columns study time score and the values we can also go ahead and visualize them by saying plt scatter data dot um what was it study time and data dot score then plt dot show like that

And there you go so those are the data points that we’re going to use for this regression example here you can actually pick any data points the data is not the focus here we just want to have a properly working algorithm so i’m going to delete the visualization

Here and we’re going to start with the loss function so we’re going to say def loss function you can also call this mean squared error and to this function we need to pass the m value the b value and the points that we have the actual data points so in our case data

And what we’re going to do here is we’re basically going to say we have a total error which starts at zero and what we do is we add all the individual squared errors to that and in the end we divide by the amount of points so we say 4i in range length and

What was it points we’re going to say the x value that we have is the points dot i log so at the location i we want to have the study time as the x value and then at the location i want to have the score as the y value and the error is

Basically just what we had already so if we look at my paint here uh what was the error function there you go here this is the error function we’re just gonna write it in a pythonic way now um so total error let me just see how i wrote it total

Error plus equals and then the loop is basically the sum and what we write in here is the iteration of this sum uh sigma symbol so we say to the total error we want to add y the actual y point minus what we thought then y point should be based on

M and b so m times x plus b and the whole thing squared because it’s the mean squared error and in the end what we do is we just say total error divided by float length of points like that so that is the basic loss function it

Tells us how off how much we’re off from the actual result now what you need to know about this loss function here is that we’re actually not going to use it because what we’re actually interested in is just minimizing it and it’s already included in the gradient descent because

We cannot just have this function and tell python take the derivative of it so we need to do it manually and we have done it manually already so this function is more like a function that you can use if you want to calculate the loss manually

But we’re actually not going to use it in the final optimization process so let’s go ahead and implement the gradient descent we’re going to call this function gradient descent we’re going to pass here m now b now the current values here we’re going to pass the points and want to pass a learning

Rate l um and what we’re going to do now is we’re basically going to just um start with a gradient for m of 0 and for b as well and then we’re going to say n is just the length of the points the amount of points

And what we’re going to do now is we’re going to just perform the gradient descent again we’re not going to use that function this is just a separate function that you can use manually here we’re going to already use the the formulas that we had here so

Those two formulas here that we already talked about these partial derivatives and those already include the loss function to some degree so what we’re going to do now is we’re going to say 4i in range in so for each point we’re going to do is we’re going to take that point

So x equals then points i lock i dot study time and then y equals i lock i dot score and what we do now is we calculate the gradient based on the function here again all i’m doing here is just typing these two functions to these two lines

This negative 2 divided by n whatever the loop is the sum symbol here all i’m doing here is just putting this into python code if you want to understand what’s happening here you need to go back to the mathematical explanation so we’re just going to say m gradient plus equals and then negative

2 divided by n actually the negative is out here negative 2 divided by n Times x times y minus m at this particular moment in time times x plus b at this moment in time and then b is basically the same but without the x there you go um then once we have that once these iterations are done we know what

Uh which direction we have to move into or away from actually so what we do is we basically say okay the new m is going to be the m now um and we’re going to say minus m gradient so in the opposite direction but with a learning rate of l

That determines how much we move same for b and that is actually it and in the end of course we return m and b both values that is the gradient descent function we can actually get rid of that function if you’re not interested in it

But that is all we need to do in order to perform the linear regression so now we just need to execute everything we say okay what is uh m we start with zero b we start with zero we can pick any starting values if we want to we will

Use a learning rate of 0.0001 and we’re going to use 100 iterations also called epochs we can also go with thousand actually so let’s go with a thousand and now what we do is we just say four i in range epochs so for the amount of iterations here we’re just going to say

M and b are going to be gradient descent off m b and the data and the learning rate so we’re constantly going to get a better and better and better and better at estimating the perfect m and b and in the end we can print not plt we can print m and b

And we can plot the results we can say plt dot scatter data dot study time data dot score color is going to be black and then plt dot plot and we can plot the trend line here which is basically just a list and we can look at the csv file all the

Values should be uh all the x values at least should be more than 20 and less than 80 so we can just go with list range 20 to 80 just to have some values and then we’re going to say m times x plus b for x in range 20 80.

There you go and the color of the regression line will be red then plt dot that’s it you can now run this maybe you want to print the epoch so let’s say if i modulo 50 equals 0 print epoch i like that inside of the string maybe there you go

Let’s see if that works or if we made mistakes epos0 50 100 let’s just turn the number down so we get the results faster let’s go with 300 and then we should see pretty decent results 0 50 100 and we should see a pretty solid trend line here or regression line

There you go seems about right as you can see this is a pretty good actually i think it’s the optimal linear regression line here because we had like 300 iterations those are the values here for m and b so this is how you implement linear regression from scratch in python

So that’s it for today’s video hope you enjoyed hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don’t forget to subscribe to this channel and hit the notification bell to not miss a single

Future video for free other than that thank you very much for watching see you next video and bye You
In this video we implement the linear regression algorithm from scratch. This episode is highly mathematical.

Programming Books & Merch
The Python Bible Book:
The Algorithm Bible Book:
Programming Merch:

Social Media & Contact
Website:
Instagram:
Twitter:
LinkedIn:
GitHub:
Discord:

Outro Music From:

Timestamps:
(0:00) Intro
(0:19) Mathematical Theory
(12:48) Implementation From Scratch
(24:05) Outro
00:04 [Music]
00:09 what is going on guys welcome back in
00:10 today’s video we’re going to implement
00:12 linear regression from scratch in python
00:14 and a warning right ahead it’s gonna be
00:16 mathematical so let’s get right into it
00:19 all right so let’s get started with the
00:20 very basics of linear regression what is
00:22 linear regression and what can it be
00:24 useful for and we’re going to start with
00:26 an example right away let’s say we have
00:28 a bunch of students and these students
00:30 have an exam
00:31 and they have a certain study time each
00:33 of them has a certain study time that
00:35 they invested for this exam so let’s say
00:37 we have the study time here it can go
00:40 from zero hours up until
00:42 not really infinity but pretty much
00:45 infinity depending on what the maximum
00:46 number is of the students so you can
00:48 study
00:49 two weeks in advance one year in advance
00:51 you can study one minute in advance uh
00:53 from zero up until let’s say limitless
00:55 and then we have an exam score
00:58 and the exam score is the actual result
01:00 of the test and it can go from zero
01:02 points to a hundred points or from zero
01:04 percent to a hundred percent
01:06 so what we can do now is let’s say we
01:08 have a data set with these data points
01:10 and we want to plot this here on a
01:12 two-dimensional coordinate system that’s
01:14 not the most beautiful coordinate system
01:16 i know
01:17 but let’s say the x-axis is the study
01:19 time
01:20 and the y-axis is the exam score
01:24 now if we were to visualize all these
01:26 points we would probably see something
01:28 like that with a bunch of data points
01:29 here
01:30 people who don’t study at all and get
01:32 pretty low scores people who study a
01:34 little bit get better scores
01:36 but still some that get pretty bad
01:38 scores and all that um and we can see
01:40 all these data points people that study
01:42 a lot and get a lot of
01:43 uh points or a very high score then some
01:46 people who study a lot and don’t get
01:48 very good grades at all some people who
01:50 don’t study at all and get very good
01:52 good grades those are the outliers but
01:53 most of the points are going to be
01:55 somewhere here in the middle
01:56 now what linear regression tries to do
01:58 is it tries tries to find a line that
02:01 fits these points the best now best is a
02:04 little bit
02:06 subjective because certain
02:08 regression types or certain algorithms
02:10 use different
02:12 uh approaches because some say okay i
02:14 don’t care about outliers i care a lot
02:16 about
02:17 outliers and so on depending on what you
02:19 want you have to choose a different
02:21 procedure but for linear regression what
02:23 we’re interested in is minimizing the
02:25 error and the error is basically let’s
02:28 say i have a linear function here
02:31 a blue line like that and this is
02:33 obviously not the best function that we
02:34 can find for this for these data points
02:37 but the error of that function
02:39 would be to basically go for each point
02:42 and see
02:43 if this function was correct we would
02:45 predict that for this x value here
02:47 this would be the y value and then what
02:49 we do is we just go down and see okay
02:51 this here is the error and this here is
02:53 the error and this here is the error
02:55 this is just a difference from the
02:57 prediction to the actual reality that is
02:59 the error
03:00 okay
03:01 and what we want to do is we want to
03:02 find a function that minimizes that
03:05 error and this is what linear regression
03:07 is about
03:08 so
03:09 the structure of a function is basically
03:12 of a linear function is basically
03:16 y equals m times x plus b so m is
03:20 basically the steepness and b is the
03:22 distance here so
03:24 we expect the line the final line to be
03:26 something like that to be pretty much
03:28 fitting the points maybe be influenced a
03:30 little bit by the by the outliers and so
03:32 on but all in all we want them to fit
03:35 most points we want to minimize the
03:36 error in total
03:39 and for this of course what we need to
03:40 do is we need to minimize the error
03:42 function this the actual minimization
03:45 process is a little bit more complicated
03:47 but the error function in and of itself
03:49 needs to be defined first so that we can
03:51 minimize it because in order to minimize
03:53 something with the gradient descent
03:55 algorithm that we’re going to talk about
03:56 in a second but in order to minimize
03:58 something it has to be something that
04:00 produces a value because we want to
04:02 minimize the output by tweaking the
04:04 little things that we can’t weak in our
04:06 case
04:08 if we look at m times x plus b we want
04:11 to manipulate we want to tweak m and b
04:14 so that for x i get the best possible y
04:17 with the least error
04:19 and the error function needs to be
04:21 defined in order to minimize it so the
04:22 error function is going to be
04:25 e capital e is going to be
04:28 1 divided by n which is i’m going to to
04:32 tell you in a minute becau in a second
04:33 why it’s 1 divided by n
04:36 times the sum
04:38 from i equals 0 up until n and now we
04:42 get
04:43 y i which is the actual value that we
04:46 get here so this point is y i and this
04:49 point is y i the value of the y value of
04:51 this point is the actual y i value and
04:54 from this we subtract
04:57 y
04:58 i
04:59 uh hat for example so this is just the
05:02 the predicted value this is this value
05:04 here
05:04 and we can actually remove that and
05:06 replace it by
05:08 m times x plus b
05:12 so that is the error and we can square
05:14 that and that would be the mean squared
05:17 error and we can add an i here to make
05:19 it a little bit more accurate so
05:21 if you have never seen this before this
05:23 is called the mean squared error
05:24 function
05:25 don’t be confused it’s not really
05:27 complicated it may look complicated
05:29 because it’s math and for a lot of
05:30 people math looks just complicated but
05:32 all we’re doing here is this is a
05:34 mathematical way of writing that for
05:37 each point
05:38 and n is the amount of points for each
05:40 point from zero from the zeroth point
05:43 from the first point to the last point
05:45 what we do is we get the y value of the
05:48 actual point so this y value here or
05:50 this y value here or this y value here
05:52 the actual value
05:54 and we subtract
05:57 um the position of the function y value
06:00 so if we have a function like this we
06:02 will take the actual y value here and
06:05 this y value here and this is the
06:07 difference so this is what we’re going
06:08 to get and then we square that
06:10 difference and in the end we take all
06:13 these distances all these arrows that we
06:15 have and we divide them by n which is
06:17 the amount of points so we get the mean
06:20 squared error the mean
06:23 squared
06:25 error
06:27 so this is what this function does it’s
06:28 just a fancy way of saying
06:30 take the difference from all the actual
06:32 points and what the function would
06:35 predict the proper y value would be
06:38 square that difference and divide it by
06:40 the amount of points that we have to get
06:42 the mean squared error this is the error
06:44 function that we’re trying to minimize
06:45 when it comes to linear regression all
06:48 right so this next part here is going to
06:49 be very technical and mathematical
06:51 because we’re going to talk about
06:52 partial derivatives about calculus so if
06:55 you’re not interested in that or if you
06:57 don’t understand calculus at all you can
06:59 skip to the coding part directly
07:01 i would not recommend it because even if
07:03 you don’t understand everything it’s
07:05 good to just understand or listen to
07:07 what is happening behind the scenes even
07:09 if you don’t understand all of the math
07:11 um so i would recommend you to watch
07:13 that part as well but it may be a little
07:15 bit boring because it’s very technical
07:17 and mathematical i personally think it’s
07:18 one of the most exciting things to
07:20 understand what is exactly happening how
07:22 does this optimization work but i can
07:24 understand if some of you guys say i
07:26 don’t want to listen to that i want to
07:27 get to the coding just keep in mind that
07:29 you won’t understand fully what is
07:30 happening behind the scenes
07:32 so let’s get into it um we want to
07:35 minimize that error function we want to
07:37 get the lowest possible e
07:39 for our line we want to find the line
07:41 that gives us the lowest possible e so
07:44 the only thing that we can influence is
07:45 the m
07:46 and the b
07:48 the x is just the input and the y is
07:50 just the output we want to find m and b
07:53 so that we can minimize e that is our
07:56 goal
07:57 and how can we do that we can do that by
07:59 taking the partial derivative with
08:02 respect to m and with respect to b
08:04 because that gives us the direction of
08:07 the steepest ascend with respect to m
08:09 and b so how can we change m to
08:12 maximally increase e
08:15 and how can we change b to maximally
08:17 increase e now you might say okay
08:20 didn’t we want to decrease e yes we did
08:23 want to decrease e but if you just take
08:25 the opposite direction if you have the
08:26 direction of the steepest ascent you can
08:29 just go the opposite direction and you
08:31 have the direction of the steepest
08:32 descent
08:33 so this is what we’re going to do we’re
08:35 going to take the partial derivative
08:37 with respect to m and b and then we’re
08:39 just going to go to the opposite
08:40 direction of this gradient
08:42 so we’re going to say the partial
08:43 derivative of e
08:45 with respect to m
08:48 is going to be or is actually
08:51 1 divided by n times
08:54 i equals 0
08:57 to n and now we have this squared here
08:59 so we say 2 times
09:01 those are just the basic calculus rule
09:03 so if you don’t know calculus don’t be
09:05 confused you don’t need to understand
09:06 everything
09:09 minus m times x i plus b and now we need
09:13 to also
09:14 multiply with the inside derivative
09:16 which is basically just this is a
09:19 sum so we just ignore it and this is the
09:21 factor so what we end up with is
09:24 negative
09:25 x i
09:28 um
09:29 and now we can simplify that by just
09:31 extracting the two and the negative so
09:32 we can say okay this is nothing
09:35 but just
09:36 minus 2 divided by n times the sum
09:40 from i equals 0
09:42 up until n
09:43 and then we have x i
09:46 times
09:47 y i minus m
09:50 times x i
09:53 plus b so that is the partial derivative
09:56 with respect to m
09:58 now let’s do the same thing for b it’s
10:00 basically the same but the difference is
10:02 that we don’t have the x i because
10:05 here we don’t have a factor so
10:07 this is actually the same thing
10:09 um
10:10 we basically have negative
10:13 2 divided by n
10:15 times
10:18 sum from i to n
10:21 and then just
10:22 y i minus
10:24 m times
10:25 x i plus b
10:28 that is it and this those two things
10:30 give us the direction of the steepest
10:32 ascent with respect to m and b and all
10:35 we need to do now is we need to go to
10:38 the opposite direction that’s all we
10:40 need to do so if we want to improve m
10:43 and b all we need to do is with each
10:45 iteration we say
10:46 take the current m and what you do is
10:49 you assign to it
10:51 the current or take the new m what you
10:54 do is you assign to it the current m
10:57 minus
10:58 a learning rate we’re going to talk
11:00 about that in a second
11:01 times
11:02 the direction of the steepest ascent
11:06 so
11:08 basically
11:09 e
11:11 and m and the same for b b equals b
11:14 minus l this is more of programming
11:16 notation not necessarily a mathematical
11:18 notation
11:20 but that is
11:22 what we do
11:23 so we know okay in this direction let’s
11:25 say
11:27 uh we increase the error the most so
11:29 what we do is we just go the opposite
11:31 direction and we do that with each
11:33 iteration because it changes right so
11:35 sometimes in this direction it’s going
11:37 to be the steepest descent
11:39 uh then in another direction then again
11:41 in another direction especially if we
11:42 don’t just deal with two variables but
11:44 with many different variables we don’t
11:46 do this only with uh score and with um
11:51 study time sometimes we have like 10 20
11:54 a thousand different features that we
11:55 have to take into account when doing a
11:56 linear regression so it’s not always
11:58 just two things
12:01 and this again this is the direction of
12:03 the steepest ascent which is why we
12:05 subtract because we don’t want to go to
12:06 that direction we want to go to the
12:08 opposite of that direction otherwise we
12:10 would have a plus here and not a minus
12:12 and the learning rate is basically how
12:15 big uh how how large are the steps that
12:17 we take now the larger the steps the
12:20 faster we’re going to get to the actual
12:22 optimization
12:23 but the lower the learning rate the
12:25 better it’s going to be in the end the
12:27 the better the result is going to be
12:29 because because we’re
12:31 paying attention to details much more
12:34 so i think we’re going to go with a
12:35 learning rate of
12:37 about
12:38 0.001 which is or maybe one more zero
12:43 we’re going to do that and now that we
12:45 have all the math handled we’re going to
12:46 do this in python
12:48 now we’re going to take all this theory
12:50 and turn it into python code and for
12:52 this we’re going to start by installing
12:53 two libraries however since we’re
12:55 implementing linear regression from
12:57 scratch those two libraries are not
12:59 going to be related to the linear
13:01 regression algorithm we’re just going to
13:02 use pandas to load a data set from a csv
13:05 file and we’re going to use matpotlib
13:07 for visualization the whole linear
13:09 regression process will be implemented
13:11 by us from scratch
13:13 but if you don’t have pandas in math.lib
13:15 you want to go to cmd and say pip
13:17 install
13:18 pandas and pip install
13:21 matplotlib like that and once you have
13:24 that you can just go ahead and say
13:25 import pandas spd
13:28 and import
13:29 matplotlib.pyplot splt
13:32 now what you’re going to need is some
13:34 sort of data set you can craft your own
13:36 you can just
13:37 make up some values or you can if you’re
13:40 comfortable with that take an actual
13:42 data set and apply linear regression
13:43 onto it for this video i have just
13:45 crafted my own sample csv file just some
13:48 random values that have a certain trend
13:50 inside of them
13:51 we have x y and then just a bunch of
13:53 random values here randomly generated
13:56 however
13:58 you can you can as i said go with a real
14:00 data set and we can also interpret those
14:02 as study time and score if you want to
14:05 however i think we have some values
14:06 above 100 as you can see here 117
14:09 so it’s not entirely accurate but we can
14:11 just go ahead and call this study time
14:14 and
14:15 score
14:17 so this is just a basic comma separated
14:19 value file csv file with some random
14:22 values we can load them in by saying
14:25 data equals pd
14:27 dot read csv
14:29 data dot csv and then we can take a look
14:31 at it by saying print data
14:33 to see the structure of the data frame
14:37 nothing too fancy
14:38 two columns study time score and the
14:40 values we can also go ahead and
14:42 visualize them by saying plt scatter
14:46 data dot um what was it study time
14:50 and data dot score
14:52 then plt dot show
14:55 like that
14:59 and there you go so those are the data
15:01 points that we’re going to use for this
15:02 regression example here you can actually
15:04 pick any data points the data is not the
15:06 focus here we just want to have a
15:08 properly working algorithm
15:10 so i’m going to delete the visualization
15:13 here and we’re going to start with the
15:14 loss function so we’re going to say def
15:16 loss function you can also call this
15:18 mean squared error and to this function
15:20 we need to pass the m value the b value
15:23 and the points that we have the actual
15:26 data points so in our case data
15:30 and what we’re going to do here is we’re
15:31 basically going to say we have a total
15:33 error which starts at zero and what we
15:35 do is we add all the individual squared
15:37 errors to that and in the end we divide
15:39 by the amount of points so we say 4i in
15:42 range
15:43 length
15:44 and
15:45 what was it
15:46 points
15:49 we’re going to say
15:51 the x value that we have is
15:55 the
15:55 points dot i log so at the location i
15:59 we want to have the study time as the x
16:01 value and then at the location i want to
16:04 have the score
16:05 as the y value and the error is
16:08 basically just what we had already so if
16:11 we look at my paint here
16:14 uh what was the error function
16:17 there you go here this is the error
16:19 function we’re just gonna write it in a
16:21 pythonic way now
16:24 um so total error
16:27 let me just see how i wrote it total
16:29 error plus equals and then
16:32 the loop is basically the sum and what
16:34 we write in here is the iteration of
16:36 this sum
16:37 uh sigma symbol so we say to the total
16:40 error we want to add
16:42 y the actual y point minus what we
16:45 thought then y point should be based on
16:47 m and b
16:48 so m times x plus b and the whole thing
16:52 squared because it’s the mean squared
16:54 error
16:56 and in the end what we do is we just say
16:58 total error
16:59 divided by
17:01 float
17:02 length of points
17:04 like that
17:06 so that is the basic loss function it
17:08 tells us how off how much we’re off from
17:10 the actual result
17:12 now what you need to know about this
17:14 loss function here is that we’re
17:15 actually not going to use it because
17:16 what we’re actually interested in is
17:18 just minimizing it and it’s already
17:20 included in the gradient descent because
17:22 we cannot just have this function and
17:24 tell python take the derivative of it so
17:26 we need to do it manually and we have
17:28 done it manually already
17:30 so this function is more like a function
17:32 that you can use if you want to
17:33 calculate the loss manually
17:35 but we’re actually not going to use it
17:37 in the final optimization process
17:40 so let’s go ahead and implement the
17:41 gradient descent we’re going to call
17:43 this function gradient descent
17:46 we’re going to pass here m now b now the
17:49 current values here we’re going to pass
17:51 the points and want to pass a learning
17:53 rate l
17:56 um and what we’re going to do now is
17:57 we’re basically going to just um start
18:00 with a gradient for m of 0 and for b as
18:04 well
18:07 and then we’re going to say n is just
18:09 the length of the points the amount of
18:11 points
18:12 and what we’re going to do now is we’re
18:13 going to just
18:15 perform the gradient descent again we’re
18:16 not going to use that function this is
18:18 just a separate function that you can
18:20 use manually here we’re going to already
18:23 use the the formulas that we had here so
18:26 those two formulas here that we already
18:28 talked about
18:29 these partial derivatives
18:32 and those already include
18:34 the
18:35 loss function to some degree
18:38 so what we’re going to do now is we’re
18:39 going to say 4i in range in so for each
18:43 point we’re going to do is we’re going
18:44 to take that point
18:46 so x equals
18:49 then points
18:52 i lock i dot study time
18:58 and then
18:59 y
19:00 equals i lock i dot
19:04 score
19:07 and what we do now is we calculate the
19:09 gradient based on the function here
19:11 again all i’m doing here is just typing
19:14 these two functions to these two lines
19:17 this negative 2 divided by n whatever
19:20 the loop is the sum symbol here
19:22 all i’m doing here is just putting this
19:24 into python code if you want to
19:26 understand what’s happening here you
19:27 need to go back to the mathematical
19:30 explanation
19:31 so
19:32 we’re just going to say m gradient
19:35 plus equals and then
19:37 negative
19:38 2 divided by n actually the negative is
19:41 out here negative 2 divided by n
19:43 [Music]
19:44 times x
19:46 times
19:47 y
19:50 minus
19:52 m at this particular moment in time
19:56 times x
19:57 plus b at this moment in time
20:00 and then b is basically the same but
20:03 without the
20:04 x there you go
20:08 um
20:09 then once we have that once these
20:10 iterations are done we know what
20:13 uh which direction we have to move into
20:16 or away from actually so what we do is
20:18 we basically say okay the new m is going
20:20 to be the m now
20:22 um and we’re going to
20:25 say minus
20:27 m
20:28 gradient so in the opposite direction
20:31 but
20:32 with a learning rate of l
20:34 that determines how much we move
20:37 same for b
20:40 and
20:42 that is actually it and in the end of
20:44 course we return m and b
20:46 both values that is the gradient descent
20:49 function we can actually get rid of that
20:50 function if you’re not interested in it
20:52 but that is all we need to do in order
20:54 to perform the linear regression so now
20:56 we just need to execute everything we
20:59 say okay what is uh m we start with zero
21:02 b we start with zero we can pick any
21:04 starting values if we want to we will
21:06 use a learning rate of 0.0001
21:10 and we’re going to use
21:12 100 iterations also called epochs we can
21:15 also go with thousand actually so let’s
21:17 go with a thousand
21:19 and now what we do is we just say four i
21:22 in range
21:24 epochs so for the amount of iterations
21:27 here we’re just going to say
21:29 m and b are going to be
21:32 gradient descent
21:34 off
21:35 m
21:36 b and the data
21:37 and the learning rate
21:39 so we’re constantly going to get a
21:41 better and better and better and better
21:43 at estimating the perfect m and b
21:47 and in the end we can print
21:49 not plt we can print m and b
21:54 and we can plot the results we can say
21:56 plt dot scatter data dot study time
22:00 data dot score
22:02 color is going to be black
22:06 and then plt dot plot and we can plot
22:09 the trend line here which is basically
22:11 just a list
22:13 and we can look at the csv file all the
22:15 values should be
22:18 uh all the x values at least should be
22:20 more than 20 and less than
22:23 80 so we can just go with list
22:26 range
22:28 20 to 80 just to have some values and
22:30 then we’re going to say
22:33 m times x
22:35 plus
22:36 b
22:37 for x
22:39 in range
22:41 20
22:43 80.
22:45 there you go and the color of the
22:46 regression line will be
22:48 red
22:51 then plt dot
22:53 that’s it
22:54 you can now run this maybe you want to
22:56 print the epoch so let’s say if i modulo
23:00 50 equals 0 print
23:03 epoch
23:05 i
23:06 like that
23:10 inside of the string maybe
23:15 there you go
23:18 let’s see if that works
23:20 or if we made mistakes
23:23 epos0 50 100
23:26 let’s just turn the number down so we
23:28 get the results faster let’s go with 300
23:34 and then we should see pretty decent
23:36 results 0
23:38 50
23:39 100
23:42 and we should see a pretty solid trend
23:44 line here or regression line
23:47 there you go seems about right as you
23:50 can see this is a pretty good
23:52 actually i think it’s
23:54 the optimal linear regression line here
23:56 because we had like 300 iterations those
23:58 are the values here for m and b
24:01 so this is how you implement linear
24:02 regression from scratch in python
24:05 so that’s it for today’s video hope you
24:07 enjoyed hope you learned something if so
24:08 let me know by hitting a like button and
24:10 leaving a comment in the comment section
24:11 down below and of course don’t forget to
24:13 subscribe to this channel and hit the
24:14 notification bell to not miss a single
24:16 future video for free other than that
24:18 thank you very much for watching see you
24:19 next video and bye
24:21 [Music]
24:37 you

LinearRegressionML

20 COMMENTS

  1. this video is amazing, good job, im now actually thinking about re visiting the math classes i couldn't take before in order to get better at these machine learning algorithims.

  2. Awesome tutorial! Could you please explain why you use gradient descent to minimize squared error instead of using the formula: divide the standard deviation of y values by the standard deviation of x values and then multiply this by the correlation between x and y?

  3. I implemented it myself and it came out to be more accurate then sckit-learn

    import numpy as np

    from sympy import Symbol, solve, diff

    class LinearRegression:

    def __init__(self):

    self.w = {}

    def fit(self, x, y):

    for i in np.arange(x.shape[1] + 1):

    self.w[f"w{i}"] = Symbol(f"w{i}")

    e = 0

    for i in range(len(x)):

    yp = 0

    for j in np.arange(len(self.w)):

    if j == 0:

    yp += self.w[f"w{j}"]

    else:

    yp += self.w[f"w{j}"] * x[i][j-1]

    e += (yp-y[i]) ** 2

    eq = []

    for i in np.arange(len(self.w)):

    eq.append(diff(e, self.w[f"w{i}"]))

    w = solve(eq, list(self.w.keys()))

    for i in np.arange(len(self.w)):

    self.w[f"w{i}"] = w[self.w[f"w{i}"]]

    def predict(self, x):

    def prediction(features):

    yp = 0

    for i in np.arange(len(self.w)):

    if i == 0:

    yp += self.w[f"w{i}"]

    else:

    yp += self.w[f"w{i}"] * features[i-1]

    return yp

    return list(map(prediction, x))

  4. I am getting error

    AttributeError Traceback (most recent call last)

    Cell In[10], line 39

    37 if i % 50 == 0:

    38 print(f"Epoch: {i}")

    —> 39 m, b = gradient_descent(m, b, data, L)

    41 print(m, b)

    43 plt.scatter(data.studytime, data.score, color = "black")

  5. great video NeuralNine
    I feel this the gradiant descent. is not it ?
    if It is, is there an implementation for the least square
    because I feel it is just so random you take some values and you choose the least one from those random values not necessarily the least value you can get

LEAVE A REPLY

Please enter your comment!
Please enter your name here