Home Business Analyst BA Agile Coach 10 ML algorithms in 45 minutes | machine learning algorithms for data...

10 ML algorithms in 45 minutes | machine learning algorithms for data science | machine learning

51
20

If you have an interview coming up and you want to revise 10 most important machine learning Algorithms real quick you will not find a better video than this let’s go ahead and do the revision of 10 most frequent used ml algorithms these are the 10 algorithms I am going

To explain you how they work and what are their pros and cons okay and as you can see first five algorithms is in one color next three is in a different color and last two is in a different color there is a reason for that guys I will

Tell you in a moment but before that let’s try to answer two basic questions okay let’s try to answer what is machine learning and what are algorithms okay so I’ll start with a non-bookish definition and I will give you one simple example suppose you want to travel from

Bangalore to Hyderabad okay where you want to go you want to go from Bangalore to Hyderabad for this you can either take a train or you can either take a flight or you can take a bus as well or maybe you can drive your own car as well okay so two

Things we have to understand here guys what is the task okay and what is the approach fine so the task in hand is we have to go from Bangalore to Hyderabad okay and the approach is all these three options that I told you just now now related to the world of machine

Learning in machine learning the task can be different kinds of tasks okay for example it can be a regression task okay or it can be a classification task okay or it can be a unsupervised learning problem I will just write unsupervised okay so in approach section we can have

Different different approaches based on if we are solving a regression problem or we are solving a classification or we are solving a particular case of unsupervised learning okay in regression also we can take many approaches for example in regression there is not only one approach in regression I can take

Approach one approach two approach 3 approach 4 approach five in classification I can take this approach this approach this approach in unsupervised also I can take multiple approaches so that is why this color coding is there the first five algorithms that you see here will solve I will explain you for

Regression use case Okay so there we will take a regression use case and try to understand how to solve that using these five algorithms okay the next three that you see I am going to explain you with a classification use case so these approaches are for classification problem okay

And last two I am going to explain you for a unsupervised learning problem how that will be this these algorithms will be used to solve unsupervised learning problem okay so let’s go ahead guys and try to understand with a simple input data I have taken a sample input data here and

Let’s without any delay start on the first Algorithm known as linear regression so machine learning is all about learning pattern from the data using algorithms okay so if we are using a algorithm known as linear regression then what will happen let’s try to understand that so first algorithm of

Our list linear regression okay now suppose this is the employee data of an organization you have a age column you have a salary column fine so 22 years person earns 23 000 and so on and so forth suppose we using the linear regression approach to solve this

Regression problem now as I told you first five problems will be regression problems first five algorithms you will understand using regression problem okay come here this is your data so what linear regression will do is it will just take this data and it will see how the data is plotted on a XY

Plane like this for example on one axis we can take salary okay on y axis and on x axis we can take Edge okay and I am just roughly pointing these points okay first point 22 and 23 000 maybe it can come somewhere here on x axis if you

Put h on Y axis salary I am just putting here second data point can come somewhere here let’s say 41 and 80 000 data points and third data point 58 and 150k this data point can come maybe somewhere here I can say okay so what linear regression will do is it

Will try to plot a line okay ideally what the assumption is all these points should fall on same line a line like this can be plotted or a line like this can be plotted but the Assumption here is ideally in an Ideal World all these

Points will fall in the same line but it will never happen in the real world so what logistic linear regression will do is it will try to fit something known as a best fit line okay so this is your best fit line let’s assume that how this

Best fit line is computed it will try to minimize the distance from all these points together so distance from this point is this distance from this point is this parallel to Y axis distance from this point is this okay so you can call this even you can call this E2 you can

Call this E3 okay so what linear regression will do is it will try to minimize even Square plus E2 square plus E3 Square for whichever line it finds the minimum even Square E2 Square E3 Square it will call that line as the model okay it will call that line as the model

Now as you know from your normal understanding of mathematics this straight line will have a equation in the form of mostly simplest we can write Y is equal to MX plus C right in our case I can say salary is equal to M times h m times of H this is

Multiplication plus c c can be an intercept let’s give some number here some random number I will give let’s say 2000 okay so imagine this line which is the model for linear regression has this formula okay now the next question comes tomorrow when the pattern has been

Learned and a new age comes let’s say age is 50. so what will be the salary for that person so very simple the model will come here and put the numbers here for example if for M we can put any number let’s say 0.2 then age will be 50 and then salary will

Be intercept will be 2000 whatever this calculation comes that will be the prediction of the salary for this 50 okay very simple very simple mathematical model the assumption is there is a linear relation between independent variable and Target variable okay that’s the Assumption so what it

Will do it will try to plot a line what it will call as a best fit line wherever it finds this value as minimum once the best fit line comes then how the prediction happens like this okay obviously there will be pros and cons of all the algorithms all the models so

What is the pros and cons of linear regression the the pluses or Pros for this model will be it’s a simple to understand model it’s a mathematical model you can explain to someone but the cons will be um it’s not necessary that your data will always be this simple that can be

Fit in a line right or close to a line so it’s a simple model hence lot of real world problems it may be difficult to solve with simple linear regression there can be a varieties in linear regression that um I have created videos you can watch through those videos but simply linear

Regression works like this okay this is one first approach first approach means first algorithm now let’s go ahead and try to see how decision tree will approach the same problem okay how decision tree will approach this same problem so if you give this same data okay if

You give the same data to decision tree and you ask hey learn pattern from this data what decision tree will do is it will just try to break the data how it will break the data is it will create a rule like this okay so I can write a

Rule here for example I can say is less than equals to 30 this is a rule okay so some records will satisfy this rule okay some records will satisfy and some records will not satisfy this way data will break okay if you come here is less

Than 30 how many records only one record is more than 30 two records so how many records will come this side only one record will come okay so let’s say that record is I should not write the wrong Numbers 22 23k 4180k so I will write here 22 and 23 K and

Here I will write 41 and 80k okay and there is one more record let me take the numbers 58 and 150k 58 and 150k understand this carefully guys because for next next algorithms this is the base okay so decision tree will split your data like this so you had total how many

Records in the beginning three records here how many records you are having one record here how many records you are having two records okay so this is first level of split now definitely can split it one more time okay so tree can make here there are limited

Number of Records but imagine if there are more records there can be one more split here saying you know another filter is is maybe less than 40 or something like this okay but I will not take that now that will make the tree complex okay so this is your model

Breaking your data based on some conditions is nothing but your model so somebody asks you what is a model in decision tree this is your model now the important question is suppose tomorrow somebody comes and asks for a person with age 50 what is your prediction for

A person with age 50 what is your prediction very very important concept to understand guys decision tree will come and check what is this for age 50 okay so age 50 will come in which category will come in this line okay in this line how many records are there two

Records so decision tree will go ahead and take the average of these two salaries so for age 50 your prediction will be what will be the prediction guys for age 50 prediction will be 80k plus 150k divided by 2. okay this is how decision tree will be making the prediction

Suppose you ask through this entry hey what will be the salary of a person with age 21 so it will not go to right hand side it will go to left hand side because this is the tree branch in which it should go it will directly say 23k in

This case because there is only one record Suppose there are two records it will take the average okay so you see how these two approaches are different for solving same regression problem here a mathematical line will be fit and here a decision tree you know data will be broken into multiple pieces and

Prediction will be made okay remember guys decision tree is based for many other Advanced algorithms and our third algorithm in the list is something non as a random Forest okay a random Forest what random Forest will do is it will say decision tree okay you have done a good job but

Uh there is a chances of overfitting of the data so we did not discuss pros and cons of this process it’s a simple model you know you don’t need to do a lot of mathematics Etc and cons is there is a chances of overfitting because you know

If there is a little change in the data your model may change totally that’s a risk here in decision tree so overfitting So Random Forest will come and say Hey you are taking a right approach but there is a chances of overfitting so why don’t you fit multiple trees so what

Random Forest will do is it will come and create multiple trees this is your tree one okay like the way we saw decision tree this is your for example tree one okay this is your for example tree two okay and similarly there can be n number of

Trees okay similarly there can be n number of trees so we will call this as T1 we will call this as T2 and that there can be you know 500 trees for example so what random Forest will do is it will say two deficiently hey if you are

Fitting one tree there is a chance of result being biased or there is a chance of overfitting or there is a chance of model not being stable but what I will do is I will fit 500 trees okay and how I will make the prediction is very important to understand here guys

Prediction of random Forest will be average of all these prediction for example if we are trying to predict for the age 50 right for the age 50 what will be the salary if we are trying to predict okay then in random Forest it will take prediction from tree one plus

Prediction from 3 2. Plus prediction from tree 500 okay it will take all the predictions and it will take average of that what is the what is the thing that we are trying to achieve here suppose in one decision tree your tree is overfitting or not performing well or is

Biased okay so what may happen in diffusion trees since you are taking a feedback from 500 different trees so that overfitting problem or model in stability problem may not be there okay so this is how random Forest is different from decision tree remember all these individual trees will not be

Using all the data for example suppose in your data there is one thousand rows and 10 columns okay just an example I am giving so all these all these trees will not use necessarily all the records it may be possible that tree One is using 100 records and three

Columns randomly selected three two T2 is using three two hundred records and three columns randomly selected okay and that is the advantage of this random Forest that all these trees Will May learn a different kind of pattern and when you take a aggregated result then you will have all the flavors okay this

Kind of learning that I just explained you is known as and Sample learning okay remember guys at unfold data science you will find a big playlist explaining all the algorithms of Ensemble learning in detail I will paste the link in the description you must check if you have

Any confusion on how and simple learning works okay but there is more to Ensemble learning what happened just now in random Forest is known as parallel way of learning okay parallel way of learning parallel way of learning why parallel way of learning guys because here tree

One and three two and three three are independent of each other when you call a random forest model 31 can start building by taking a sub sample of the data 3 2 can start building by taking a subsample of the data they are not dependent on each other okay so all

These things can happen parallely hence we call it a parallel learning now the question is is there another way of learning in Ensemble yes there comes our next algorithm known as add a boost okay Ada boost standing for adaptive boosting so what Ada boost will do is

Let me write the data here let me write the data one more time and I may be writing some different numbers so that’s not important just understanding the concept is important okay so 42 I will write 50 000 and let’s say 58 I will write 150 000 just as an

Example this is your input data so boosting is another technique boosting is another technique of Ensemble category okay in boosting especially at a boost what will happen is it will assign a weight to all your observations okay suppose this is your original data for training salary being your target column so initial weights

Initial weights okay and what the initial weights will be it will be the same weight for all your records for example there are three records so one by three I am saying one by three I am saying one by three I am saying so all the rows are equally important okay

Try to understand the concept guys in Ada boost in the beginning first iteration all the rows are equally important okay but how Ada boost works is in the name only there is adaptive it adapts to the mistakes of the previous model now why I am saying a previous

Model and next model is one thing you have to always remember at a boost is a sequential learning process you you remember how I just now told random Forest is a parallel learning process so in random Forest tree one and three two are independent of each other okay it will take a sub

Sample and create it will take a sub sample and create nothing to do with each other but in adoboost or other boosting techniques it’s a sequential model so there will be a multiple models in this so there will be multiple models fitted to the data I

Will tell you in a moment what these models will be model 1 model 2 model 3 Model 4 and so on and so forth how many ever model comes but it will not happen parallely okay it will happen in sequence now the important thing to understand is

How this sequence will be generated okay so what will happen is this model one you can think of as a base model this model one you can think of as a base model and remember in Ada boost your decision trees will look like stumps stumps means there will be a tree like

This and there will be another tree like this so it will the depth of the tree will not be Beyond one level okay so this is called stumps in the language of machine learning so multiple stems will be created now suppose your model 1 is this first stump

What is your model one guys this first stump okay model one comes and make some prediction about the salary model one comes and make some predictions about this salary okay so what we will have is another column called as salary underscore prediction and where from this prediction Comes This prediction comes

From model one the first model okay so obviously there will be some mistakes so 22 000 may be said as 21 900 and 50 and 150 can be said as 50 can be said as let’s say 52 000 okay and 150 can be said as let’s say two hundred thousand

Based on this first model first decision tree that it is creating which I am calling a system so there will be some differences between actual and predicted and from this there will be a residual coming residual means errors right residual means errors okay so what will

Be the errors 21 900 minus 22 000 right so it will be for example I can say a hundred actual minus predicted it is minus two thousand and it is minus minus 50 000 because we have put okay so this is the errors these are the actual values and

The first model what it predicts right those are the errors from the first model OKAY twenty two thousand minus twenty one nine hundred is one hundred and so on and so forth now these are the initial weights okay so what will happen in the next model when the M2 is fitted right these

Initial weights will be changed and more preference will be given to the observations where these residuals are more okay I am repeating one more time guys M1 will predict this and then residuals or errors will come when the M2 is trained right then the weights will not be same for all these three

Records rather weight will be increased for this because you are getting more errors here and weight will be decreased for this because you are getting less error here okay and so on and so forth M2 will come compute create the residual then again weights will be adjusted M3 will come

Predict residual will be calculated weights will be adjusted and finally what you will get is a combination of what will be your final model your final model will be a combination of base model I am calling it the first model okay plus M1 plus M2 plus M3 plus so on

And so forth remember this this is not a mathematical equation this is just indicative equation I am giving you okay if you want to understand more mathematics behind it please go ahead and click on the link I’m giving you in the description okay and all these

Things will not have equal say in the final output their say also will be different in the final output for example in random Forest you saw all the models have equal C in the final output we are dividing by 500 okay but here all these models will not have

Equal say they will have an equal say okay let’s move ahead to another what is the pros and cons for this model again this model will give you a may give you a better result than most of the models because it is adapting to the changes

But if you have a larger data side it may it may need more resources to train and also it is a one kind of Black Box model some kind of Black Box model means you don’t have much explanation of what is going on inside apart from some hyper parameters okay

Let’s move ahead to the last algorithm integration category known as gradient boost okay what is the last algorithm integration category gradient boost remember guys all these algorithms that I’m explaining you I have not taken anything that is used less all are used more only okay so I will take a simple data age

Salary is 21 salary let’s say 20K is 40 salaries let’s say 42k is 58 salary is let’s say 60k this is your input data and you want to run a gradient boost on this what will happen is understand guys this is again a sequential learning not a

Parallel learning okay so there will be a base prediction for all these data base prediction okay base prediction what is the base prediction guys base prediction is nothing but it’s a kind of dumb model it will assume that for all these guys it will be a average of you

Know all these three records so what is the average of this uh 80 plus 42. 80 plus 42 divided by 3 right so 2 1 1 2. right let’s say assume for Simplicity this is 36k okay so the base prediction will be put here 36k 36k 36k one is the

Base prediction comes then there will be a residual computed okay residual will be the difference between actual and predicted values whatever these numbers are fine now comes the interesting part how gradient boost is different from Ada boost or other algorithms so what gradient boost will do is it will try to

Fit a model on this residual okay it will try to fit a model on this residual and try to minimize these residuals so that will be called as a base model okay and then there will be next model you can call it residual model one okay and

Then there will be a next model you can call it residual model 2 and so on and so forth okay so what will happen is residuals will be computed and then whatever the residual comes based on that base prediction will be updated so for example let’s say your residual here

Is how much 20 minus 36 minus 16 is your residual right so this will act as a independent column and this residual will act as a Target column and then let’s say in the prediction this minus 16 is is comes as let’s say minus 10. so what will happen is this

Base prediction will get updated by this this base prediction will get updated again it’s a complicated model if you want to understand more details there are links in the description please click on that it will be very clear to you okay so what will happen base model plus residual model 1 plus residual

Model 2 so on and so forth and there will be some parameters which will assign weight to all these models so as I say all these models will not have equal vote in the final output there will be a different votes in this fine so this is about gradient boost one of

The famous algorithm for winning kaggle competitions and most of the things so gradient boost and there is another variant of gradient boost known as xgb extreme gradient boost please go ahead and read about this algorithm guys I am not covering because there is a slight difference between gradient boost and

Sgb you can read about that as well fine let’s move ahead to the second category of algorithms known as classification algorithms so in classification algorithms the first algorithm that I am going to cover is logistic regression now very very important guys please pay attention here and try to understand how

Logistic regression is going to work for any given scenario it’s a mathematical model hence it is important for you to understand okay suppose this is an employee data and you have 21 22k whether the employee leaves the organization or does not leave the organization just I am saying 1 0 okay

And then 40 year guy makes let’s say 42k leave 0 no 58 year guy makes let’s say 60k just for example leaves know one so this is a classification problem where we are trying to predict whether a employee will leave the organization or does not leave the organization the last

Column that you see is your target column the last column that you see is your target column this type of problem is called a classification problem because what this what the objective of this model is tomorrow I give you age of the employee for example 31 salary of

The employee for example 34k and I asked to the model hey Will the guy leave or not leave the organization okay so this is a classification problem how logistic regression will take this problem is we have we have to understand some mathematical Concepts here so if you see

Here the target column is 1 0 only so that is either one or zero one or zero okay so which means that Y which is our Target can be understand this is very important concept guys can be either 0 or 1 it cannot be anything else your target cannot be

Anything else apart from 0 or 1 but your age and salary can take any real number X can be any value between minus infinity to plus infinity right so X can be any value between minus infinity 2 plus infinity y can be only 0 or 1 okay

So what we have to understand here is we have to somehow create a relation that will enable us to predict y given X okay the problem here is on the left hand side we have minus infinity to plus infinity range that is X range okay so I

Will write here x x means independent features on the right hand side your values can be only 0 to 1 0 or 1 not 0 to 1 okay so what we do is we do not directly predict y rather we predict something else what is that something else that we

Predict so in place of predicting y we predict probabilities okay probabilities of an observation falling in y probabilities l i t i e s Okay so what we will do is we will predict probabilities then the range will be 0 to 1 as you know probability can take

The range between 0 to 1 okay now this range is also not what we are looking for minus infinity to plus infinity so what we will do is we will do one more transformation and we will make this as a odds so what is the range

Of odds 0 to Infinity okay but still we are not minus infinity 2 plus infinity range so what we will do is we will take log of odds okay log of odds okay and then the range will become minus infinity to plus infinity how your equation will look like here is

When you say log of odds right so on the right hand side it will be log of P by 1 minus p okay on the left hand side you will have beta node plus beta 1 x 1 plus so on and so forth okay this equation that you see

In front of you now is called the base equation for the logistic regression now one important concept to understand here guys this is a logic function okay and inverse of logit function H sigmoid function okay support suppose you take the inverse of this or sigmoid of this

So what will happen is if you apply Sigma at both sides so if you don’t know what is sigmoid function then sigmoid function f x looks like this 1 by 1 plus e to the power minus X this is your sigmoid function on XY plane how it will

Look like is this suppose this is your 0 this is your 1 and this is your 0.5 okay so sigmat will look like this so it will always be between 0 to 1 okay so your logistic regression this equation will be changed in the form of sigmoid

Function so your f x or P okay P will look like if you take if you take sigmoid on both sides right then on the right hand side you will just have p and here we will have 1 by 1 plus e to the power minus

Beta0 plus beta 1 x 1 okay remember guys this equation is equation 1 and this equation is equation to both the equations are same the difference is this is a logit equation and this is a sigmoid equation okay now take it if you if you take a inverse of logit that is

Nothing but sigmoid okay understand this carefully and now this equation from our example how it can be written is 1 by 1 plus e to the power minus beta 0 plus beta 1 into age okay plus beta 2 into salary this is nothing but your logistic regression equation okay and as you know

As I told you this is a sigmoid function so the output of what you see here output of this will always be between 0 to 1 which means you can get a probability and then you can say that based on this probability I can say whether the employee leaves or does not

Leave okay logistic regression is again a very important and not easy to understand concept okay so as you can see we are modeling a categorical variable against real numbers hence we need to do certain Transformations these are the Transformations that we need to do and how it relates to the probability

I just explained you now okay pros and cons mathematical model not very difficult to understand cons it again assumes a lot of things about the data which may or may not be correct hence it may not give a great result all the time okay but very famous and very important algorithm to understand

Next algorithm in the category in the classification category one simple one I want to cover here that is known as gear nearest neighbor okay it’s a pretty simple algorithm suppose in the same data on this data you want to build a k n algorithm okay so since I have data

Here so I will explain here only so what can happen is it will plot a x-ray plane like this okay and it’s a three-dimensional data so you can have one more axis for salary or you can have two access only because from two axis we can we can predict okay

So let’s come here age and let’s say here salary okay out of these three employees let’s say one employees 21 22 employees Falls here and second employee 40 Falls here and 58 Falls somewhere here okay so what K nearest neighbor will do is it will try to allocate neighbors to all these

Individual observations for example this is your observation one this is your observation two and this is your observation three okay so one does not has any neighbors but 2 is the neighbor of 3 and 3 is the neighbor of two okay so tomorrow some prediction comes for

Let’s say age 50 again I will take 50 example 50 example so what it will do is it will try to see and I will take salary also because in this case salary is also there so salary is let’s say 61k okay so what it will do

Is it will try to see where can I fit this 58 percent and salary 61k Maybe who are the nearest neighbor to that guy so the nearest neighbor to that guy may be this guy and this guy suppose that new guy comes somewhere here okay so who

Are the neighbors for this this is the first neighbor this is the second neighbor okay so it will simply go ahead and take the you know mode of results for example these these two guys are the are the second neighbors right I mean two neighbors of that so it will take 0

And 1 which is maximum so in this case there is no mode of the data but obviously if you take a larger data there will be modes of the data okay so whichever mode for example Suppose there are 30 records out of that 20 is 1 and

10 is 0. so the prediction for this guy will be whatever is maximum or whatever is mode so if the mode is one or zero whatever it is that will be the prediction for k n okay so as I told you Cannon is a pretty simple algorithm it

Will just plot your data try to find the nearest neighbors and then when a new observation comes you give how many how many Observer how many neighbors you want for that record and it will create one based on that okay so Canon is a simple to understand algorithm nothing

Complex in that so I covered quickly in that that slide itself okay now let’s try to understand another classification technique known as support Vector machines or svms so what svms will do is it will plot your data in whatever axis you have suppose age is one axis and salary is one axis okay

And your data points I will take little more data points okay your data points look like this so these are some data points and this is these are some more data points okay so what sbm will do is it will try to create something known as a decision

Boundary okay how this decision boundary is different from linear regression decision boundary is in any integration there is a pure mathematical equation involved here there is a concept of something known as a hyper plane okay for example if I draw a line between this right so all these guys black guys

You can think leaves or Target column is one all these guys you can think does not leave or Target column is zero does not leave okay suppose your data is like this so what will happen is your svm will plot this is called in the language of sbm this is

Called a decision boundary okay decision boundary so in this case your data looks pretty simple pretty separated hence the decision boundary can be as simple as a line okay but in most of the scenarios real world scenarios decision boundary cannot be as simple as this okay so

There will be some black dots here there will be some black circles here okay and there can be some this Cross Blue Cross this side right so in this case decision boundary is not doing Justice so decision boundary need to change and that is where the concept of hyper

Planes and kernels two very important Concept in svm guys if you want to explore more on sbm hyper planes okay and kernels so when your data become complex then simple decision boundaries cannot predict it well okay so you need to have a have a complex decision boundary and

That is where hyperplane and kernels concept come but just to give you an idea of how svm works it will create a decision boundary and tomorrow any prediction any new result come for example somebody asks what is the um you know for a person with for a person with age 50

And salary is 60k whether the person will leave or not leave so this svm model will see on which side of decision boundary this guy is falling if this guy falls on this side of decision boundary it says do not leave if this guy falls on this side of decision boundary it

Says leaves okay so in svm remember concept of decision boundaries hyper planes kernels and kernel tricks okay so we have covered three things from the classification scenarios and five things from the regression scenarios let’s go ahead and try to see some unsupervised learning problems okay so what is the

Meaning of unsupervised learning till now we are having a Target column but in unsupervised learning we may not have a Target column okay suppose for the same employee data we have age and salary but somebody comes to you and tells you that hey can you tell me if there are

Different buckets of employees existing in my organization different buckets means some people with less age and more salary some people with more is endless salary so are there different buckets somebody can can come and ask you okay so how you will solve that problem is by using something knowledge Clustering or

Segmentation okay so suppose the task in hand is here there are three records only but there can be more records right in the real world scenario what I am interested in knowing is if there are natural clusters in my organization so this is my organization data on one axis

I have is on other axis I have salary okay and I have multiple data points here three data points only but I am plotting more data points just for demonstration okay so there is nothing to predict but employed is interested in knowing if there are buckets means if

Few employees are closer to each other in terms of their characteristics so for example these employees are closer you can call bucket one these employees are closer you can call bucket two or segment 2. okay but how this will be implemented is in K means clustering so one technique for implementing bucketing

Is K means clustering okay there can be other techniques also for segmentation or bucketing one technique is K means clustering in this technique what will happen is the distance between the various employees will be computed for example this is your employee one and this is your employee two okay suppose I

Ask you how similar is employee one from employee two so there can be different similarity metric that you can compute for example euclidean distance or Manhattan distance or cosine similarity Etc I have detailed video on these things as well I will link it in the description but suppose I tell you a

Simple uh you know how the distance how the similar Sim how these two employees are similar or different so you will say 21 minus 40 whole Square plus 20K 20K minus 42k whole Square so on all the dimensions you are taking the distance between them squaring it and under

Rooting it this is called euclidean distance between E1 and U2 whatever number you get it okay so suppose E1 and E2 equilibrium distance is less and E1 and E3 equilibrium distance is more so in that case you say E1 and E2 are closer to each other okay and in the similar way

You start finding the employees which are closer to each other and then you call this as one bucket similarly this score you call is at an another bucket okay remember I have explained you in simple terms but there is a very important Concept in k-means known as

Centroid concept okay so please go ahead and watch unfold data science detailed the video on k-means clustering you will understand all the details of how centroid is defined and how this algorithm works at a mathematical level okay I will link that video please ensure you watch that

So this is about k-means clustering now last but not the least guys you might have seen in Amazon and Flipkart that there are different different uh products that is recommended to you for example if you buy a laptop then it will tell you hey go ahead and buy this

Laptop bag as well so this is nothing but a recommendation okay in the Netflix if you watch let’s say one movie one action movie let’s say if you watch Mission Impossible then it will go and recommend you Jack Ryan series maybe okay so this is called a recommendation

System that is running in background okay so how this system works one simple uh yet powerful technique for recommender system is known as collaborative filtering collaborative filtering okay so what collaborative filtering does is it will take users okay users and it will take items try to understand this simple concept

Edge it’s pretty simple to understand so users can be a month and users can be John and users can be do okay and in the items we can have let’s say Mission Impossible in the atoms we can have Jack Ryan in the atoms we can have another any movie

Of James Bond Series in the atoms we can have Spiderman okay in the atoms we can have any comedy movies for example home alone you can say okay so Aman which movie Aman watches or which movie Aman has watched for example Mission Impossible Aman has watched Jack

Ryan he has watched but he has not watched let’s say this movie Zero I will say okay James one movie and this movie he has not not uh watched okay Spider-Man movie there is another guy John who has watched Mission Impossible Jack Ryan James Bond movie and Spider-Man movie as well

There is another guy doe who has not watched any of these movies but has watched Home Alone the comedy movie okay so the which users are similar to which user will be computed based on one of the user similarity metric so what are the user similarity metric I told you

Cosine similarity it can be different kind of distance metric so as you can think from the common sense also here Aman watches action movies if you can see here and John also watches action movies more Mission Impossible and Jack Ryan but Aman has not watched James one movie and

Aman has not watched Spider-Man movie so what will happen is since Aman and Jon are similar to each other so go ahead and watch the movies that Jon has watched but Aman has not watched because Aman and Jon both tastes are similar so go ahead and recommend what John has

Watched but Aman has not watched so what will be the recommendation going to Aman James Bond movie and Spider-Man movie Okay now imagine this is a large metric of large users and large items so it will be seen which users tastes are similar to each other okay and then

The other user which has not watched that movie will be recommended the movies or series based on the similar users watching history okay this is pretty simple but powerful technique known as collaborative filtering so let’s revise once guys what all we discussed long discussion but very very fruitful for you to revise few

Fundamental concepts linear regression decision tree random Forest data boost gradient boost for segregation we discussed classification I explained you logistic regression how svm works and how k n works and I explained you two unsupervised technique came instant collaborative filtering now not in too much detail I went because it’s not

Possible to go in all the details of 10 algorithms in short time but read this as a refresher and please go ahead and click all the links of whichever algorithm you are more interested in learning all the videos are there on all four data science okay

I request you guys please press the like button if you like this video and please press the Subscribe button and the bell icon if you want me to create more videos like this see you all in the next video guys wherever you are stay safe and take care
10 ML algorithms in 45 minutes | machine learning algorithms for data science | machine learning
#machinelearning #datascience

Hello ,
My name is Aman and I am a Data Scientist.

All amazing data science courses at most affordable price here:

Please find link for all algorithms in detail:
Linear regression :
Logistic Regression :
Ensemble models :
SVM :
Kmeans :
Recommendation engine :

Topics for the video:
10 ML algorithms in 45 minutes
machine learning algorithms for data science
machine learning algorithm interview question and answers
machine learning algorithm in hindi
machine learning algorithm mathematics
machine learning all topics
machine learning algorithm telugu
machine learning algorithm projects

About Unfold Data science: This channel is to help people understand basics of data science through simple examples in easy way. Anybody without having prior knowledge of computer Programming or statistics or machine learning and artificial intelligence can get an understanding of data science at high level through this channel. The videos uploaded will not be very technical in nature and hence it can be easily grasped by viewers from different background as well.

Book recommendation for Data Science:

Category 1 – Must Read For Every Data Scientist:

The Elements of Statistical Learning by Trevor Hastie –

Python Data Science Handbook –

Business Statistics By Ken Black –

Hands-On Machine Learning with Scikit Learn, Keras, and TensorFlow by Aurelien Geron –

Ctaegory 2 – Overall Data Science:

The Art of Data Science By Roger D. Peng –

Predictive Analytics By By Eric Siegel –

Data Science for Business By Foster Provost –

Category 3 – Statistics and Mathematics:

Naked Statistics By Charles Wheelan –

Practical Statistics for Data Scientist By Peter Bruce –

Category 4 – Machine Learning:

Introduction to machine learning by Andreas C Muller –

The Hundred Page Machine Learning Book by Andriy Burkov –

Category 5 – Programming:

The Pragmatic Programmer by David Thomas –

Clean Code by Robert C. Martin –

My Studio Setup:

My Camera :

My Mic :

My Tripod :

My Ring Light :

Join Facebook group :

Follow on medium :

Follow on quora:

Follow on twitter : @unfoldds

Get connected on LinkedIn :

Follow on Instagram : unfolddatascience

Watch Introduction to Data Science full playlist here :

Watch python for data science playlist here:

Watch statistics and mathematics playlist here :

Watch End to End Implementation of a simple machine learning model in Python here:

Learn Ensemble Model, Bagging and Boosting here:

Build Career in Data Science Playlist:

Artificial Neural Network and Deep Learning Playlist:

Natural langugae Processing playlist:

Understanding and building recommendation system:

Access all my codes here:

Have a different question for me? Ask me here :

My Music:
00:00 if you have an interview coming up and
00:02 you want to revise 10 most important
00:04 machine learning algorithms real quick
00:07 you will not find a better video than
00:09 this let’s go ahead and do the revision
00:11 of 10 most frequent used ml algorithms
00:14 these are the 10 algorithms I am going
00:17 to explain you how they work and what
00:19 are their pros and cons okay and as you
00:23 can see first five algorithms is in one
00:25 color next three is in a different color
00:27 and last two is in a different color
00:29 there is a reason for that guys I will
00:31 tell you in a moment but before that
00:33 let’s try to answer two basic questions
00:37 okay let’s try to answer what is machine
00:40 learning and what are algorithms okay so
00:44 I’ll start with a non-bookish definition
00:47 and I will give you one simple example
00:49 suppose you want to travel from
00:52 Bangalore to Hyderabad okay where you
00:55 want to go you want to go from Bangalore
00:57 to Hyderabad
00:58 for this you can either take a train or
01:03 you can either take a flight or you can
01:07 take a bus as well or maybe you can
01:10 drive your own car as well okay so two
01:13 things we have to understand here guys
01:15 what is the task okay and what is the
01:19 approach
01:21 fine
01:22 so the task in hand is we have to go
01:25 from Bangalore to Hyderabad okay and the
01:28 approach is all these three options that
01:31 I told you just now
01:32 now related to the world of machine
01:34 learning in machine learning the task
01:37 can be different kinds of tasks okay for
01:41 example it can be a regression task
01:44 okay or it can be a classification task
01:48 okay
01:49 or it can be a unsupervised learning
01:53 problem I will just write unsupervised
01:55 okay so in approach section we can have
02:00 different different approaches based on
02:02 if we are solving a regression problem
02:05 or we are solving a classification or we
02:07 are solving a particular case of
02:09 unsupervised learning okay in regression
02:12 also we can take many approaches for
02:16 example in regression there is not only
02:17 one approach in regression I can take
02:20 approach one approach two approach 3
02:22 approach 4 approach five in
02:24 classification I can take this approach
02:25 this approach this approach in
02:27 unsupervised also I can take multiple
02:29 approaches so that is why this color
02:32 coding is there
02:33 the first five algorithms that you see
02:36 here
02:37 will solve I will explain you for
02:39 regression use case Okay so there we
02:42 will take a regression use case and try
02:44 to understand how to solve that using
02:46 these five algorithms okay the next
02:49 three that you see I am going to explain
02:51 you with a classification use case so
02:54 these approaches are for classification
02:56 problem okay
02:58 and last two I am going to explain you
03:00 for a unsupervised learning problem how
03:03 that will be this these algorithms will
03:04 be used to solve unsupervised learning
03:06 problem okay
03:08 so let’s go ahead guys and try to
03:10 understand with a simple input data I
03:13 have taken a sample input data here and
03:15 let’s without any delay start on the
03:18 first algorithm known as linear
03:19 regression so machine learning is all
03:22 about learning pattern from the data
03:24 using algorithms okay so if we are using
03:28 a algorithm known as linear regression
03:30 then what will happen let’s try to
03:33 understand that so first algorithm of
03:36 our list linear regression okay now
03:39 suppose this is the employee data of an
03:40 organization you have a age column you
03:42 have a salary column fine so 22 years
03:45 person earns 23 000 and so on and so
03:48 forth suppose we using the linear
03:50 regression approach to solve this
03:52 regression problem now as I told you
03:54 first five problems will be regression
03:56 problems first five algorithms you will
03:58 understand using regression problem okay
03:59 come here this is your data so what
04:02 linear regression will do is
04:04 it will just take this data and it will
04:08 see how the data is plotted on a XY
04:11 plane like this for example on one axis
04:14 we can take salary okay on y axis
04:18 and on x axis we can take Edge okay and
04:22 I am just roughly pointing these points
04:24 okay first point 22 and 23 000 maybe it
04:28 can come somewhere here on x axis if you
04:31 put h on Y axis salary I am just putting
04:33 here second data point can come
04:36 somewhere here let’s say 41 and 80 000
04:38 data points and third data point 58 and
04:41 150k this data point can come maybe
04:43 somewhere here I can say okay
04:47 so what linear regression will do is it
04:50 will try to plot a line okay ideally
04:53 what the assumption is all these points
04:56 should fall on same line
04:58 a line like this can be plotted or a
05:00 line like this can be plotted but the
05:03 Assumption here is
05:04 ideally in an Ideal World all these
05:07 points will fall in the same line but it
05:09 will never happen in the real world so
05:11 what logistic linear regression will do
05:14 is it will try to fit something known as
05:16 a best fit line okay so this is your
05:18 best fit line let’s assume that how this
05:21 best fit line is computed it will try to
05:24 minimize the distance from all these
05:27 points together so distance from this
05:29 point is this distance from this point
05:31 is this parallel to Y axis distance from
05:34 this point is this okay so you can call
05:36 this even you can call this E2 you can
05:38 call this E3 okay so what linear
05:41 regression will do is it will try to
05:44 minimize even Square
05:46 plus E2 square plus E3 Square
05:49 for whichever line it finds the minimum
05:52 even Square E2 Square E3 Square it will
05:55 call that line as the model
05:57 okay it will call that line as the model
06:00 now as you know from your normal
06:02 understanding of mathematics
06:04 this straight line will have a equation
06:07 in the form of mostly simplest we can
06:09 write Y is equal to MX plus C right in
06:13 our case I can say salary is equal to M
06:17 times h m times of H this is
06:20 multiplication plus c c can be an
06:23 intercept let’s give some number here
06:25 some random number I will give let’s say
06:27 2000 okay
06:29 so imagine this line which is the model
06:32 for linear regression has this formula
06:35 okay now the next question comes
06:38 tomorrow when the pattern has been
06:40 learned and a new age comes let’s say
06:43 age is 50.
06:44 so what will be the salary for that
06:46 person so very simple the model will
06:49 come here and put the numbers here for
06:52 example if for M we can put any number
06:54 let’s say 0.2
06:56 then age will be 50 and then salary will
07:00 be intercept will be 2000 whatever this
07:03 calculation comes that will be the
07:05 prediction of the salary for this 50
07:07 okay very simple very simple
07:10 mathematical model the assumption is
07:13 there is a linear relation between
07:15 independent variable and Target variable
07:17 okay that’s the Assumption so what it
07:20 will do it will try to plot a line what
07:22 it will call as a best fit line wherever
07:24 it finds this value as minimum once the
07:27 best fit line comes then how the
07:29 prediction happens like this okay
07:32 obviously there will be pros and cons of
07:34 all the algorithms all the models so
07:36 what is the pros and cons of linear
07:38 regression the the pluses or Pros for
07:41 this model will be it’s a simple to
07:42 understand model it’s a mathematical
07:44 model you can explain to someone but the
07:46 cons will be
07:48 um it’s not necessary that your data
07:51 will always be this simple that can be
07:53 fit in a line right or close to a line
07:55 so it’s a simple model hence lot of real
07:59 world problems it may be difficult to
08:00 solve with simple linear regression
08:02 there can be a varieties in linear
08:04 regression that
08:05 um I have created videos you can watch
08:07 through those videos but simply linear
08:10 regression works like this okay this is
08:12 one first approach
08:14 first approach means first algorithm now
08:16 let’s go ahead and try to see how
08:18 decision tree will approach the same
08:21 problem okay how decision tree will
08:23 approach this same problem
08:25 so if you give this same data okay if
08:28 you give the same data to decision tree
08:30 and you ask hey learn pattern from this
08:33 data what decision tree will do is
08:36 it will just try to break the data how
08:38 it will break the data is it will create
08:40 a rule like this okay so I can write a
08:43 rule here for example I can say is less
08:47 than equals to 30 this is a rule okay so
08:51 some records will satisfy this rule okay
08:54 some records will satisfy and some
08:56 records will not satisfy this way data
08:59 will break okay if you come here is less
09:02 than 30 how many records only one record
09:04 is more than 30 two records so how many
09:08 records will come this side only one
09:10 record will come okay so let’s say that
09:13 record is
09:15 I should not write the wrong Numbers 22
09:17 23k 4180k
09:20 so I will write here 22 and 23 K and
09:25 here I will write 41 and 80k okay and
09:28 there is one more record let me take the
09:30 numbers 58 and 150k
09:32 58 and
09:34 150k understand this carefully guys
09:36 because for next next algorithms this is
09:38 the base okay
09:40 so decision tree will split your data
09:42 like this so you had total how many
09:44 records in the beginning three records
09:46 here how many records you are having one
09:48 record here how many records you are
09:49 having two records okay so this is first
09:51 level of split now definitely can split
09:54 it one more time okay
09:56 so tree can make here there are limited
09:59 number of Records but imagine if there
10:00 are more records there can be one more
10:02 split here saying you know another
10:05 filter is is maybe less than 40 or
10:08 something like this okay but I will not
10:10 take that now that will make the tree
10:12 complex okay so this is your model
10:15 breaking your data based on some
10:18 conditions is nothing but your model so
10:21 somebody asks you what is a model in
10:22 decision tree this is your model now the
10:25 important question is suppose tomorrow
10:27 somebody comes and asks for a person
10:30 with age 50 what is your prediction for
10:33 a person with age 50 what is your
10:35 prediction very very important concept
10:37 to understand guys decision tree will
10:39 come and check what is this for age 50
10:43 okay so age 50 will come in which
10:45 category will come in this line okay in
10:49 this line how many records are there two
10:51 records so decision tree will go ahead
10:53 and take the average of these two
10:55 salaries so for age 50 your prediction
10:58 will be what will be the prediction guys
11:00 for age 50 prediction will be 80k plus
11:04 150k divided by 2. okay this is how
11:08 decision tree will be making the
11:09 prediction
11:10 suppose you ask through this entry hey
11:13 what will be the salary of a person with
11:15 age 21 so it will not go to right hand
11:17 side it will go to left hand side
11:19 because this is the tree branch in which
11:21 it should go it will directly say 23k in
11:23 this case because there is only one
11:24 record Suppose there are two records it
11:26 will take the average okay so you see
11:29 how these two approaches are different
11:31 for solving same regression problem here
11:33 a mathematical line will be fit and here
11:36 a decision tree you know data will be
11:39 broken into multiple pieces and
11:40 prediction will be made okay remember
11:43 guys decision tree is based for many
11:45 other Advanced algorithms and our third
11:48 algorithm in the list is something non
11:50 as a random Forest okay a random Forest
11:54 what random Forest will do is it will
11:56 say decision tree okay you have done a
11:58 good job but
12:00 uh there is a chances of overfitting of
12:03 the data so we did not discuss pros and
12:05 cons of this process it’s a simple model
12:07 you know you don’t need to do a lot of
12:09 mathematics Etc and cons is there is a
12:12 chances of overfitting because you know
12:14 if there is a little change in the data
12:16 your model may change totally that’s a
12:18 risk here in decision tree so
12:20 overfitting
12:21 So Random Forest will come and say Hey
12:24 you are taking a right approach but
12:26 there is a chances of overfitting so why
12:28 don’t you fit multiple trees so what
12:31 random Forest will do is it will come
12:33 and create multiple trees this is your
12:35 tree one okay like the way we saw
12:37 decision tree this is your for example
12:39 tree one okay this is your for example
12:42 tree two
12:44 okay
12:45 and similarly there can be n number of
12:48 trees okay similarly there can be n
12:51 number of trees so we will call this as
12:54 T1 we will call this as T2 and that
12:58 there can be you know 500 trees for
13:01 example
13:02 so what random Forest will do is it will
13:05 say two deficiently hey if you are
13:06 fitting one tree there is a chance of
13:08 result being biased or there is a chance
13:11 of overfitting or there is a chance of
13:13 model not being stable but what I will
13:15 do is I will fit 500 trees okay and how
13:18 I will make the prediction is very
13:20 important to understand here guys
13:22 prediction of random Forest will be
13:26 average of all these prediction for
13:28 example if we are trying to predict for
13:31 the age 50 right for the age 50 what
13:34 will be the salary if we are trying to
13:36 predict okay then in random Forest it
13:40 will take prediction from tree one plus
13:44 prediction from 3 2. Plus
13:48 prediction from tree 500 okay it will
13:52 take all the predictions and it will
13:54 take average of that
13:56 what is the what is the thing that we
13:58 are trying to achieve here suppose in
14:01 one decision tree your tree is
14:03 overfitting or not performing well or is
14:05 biased okay so what may happen in
14:07 diffusion trees since you are taking a
14:09 feedback from 500 different trees so
14:11 that overfitting problem or model in
14:13 stability problem may not be there okay
14:15 so this is how random Forest is
14:17 different from decision tree remember
14:19 all these individual trees will not be
14:22 using all the data for example
14:25 suppose in your data there is one
14:28 thousand rows and 10 columns okay just
14:31 an example I am giving so all these all
14:34 these trees will not use necessarily all
14:38 the records it may be possible that tree
14:41 One is using 100 records and three
14:44 columns randomly selected three two T2
14:47 is using three two hundred records and
14:49 three columns randomly selected okay and
14:52 that is the advantage of this random
14:54 Forest that all these trees Will May
14:57 learn a different kind of pattern and
14:59 when you take a aggregated result then
15:02 you will have all the flavors okay this
15:04 kind of learning that I just explained
15:06 you is known as and Sample learning okay
15:10 remember guys at unfold data science you
15:13 will find a big playlist explaining all
15:15 the algorithms of Ensemble learning in
15:17 detail I will paste the link in the
15:19 description you must check if you have
15:21 any confusion on how and simple learning
15:23 works okay
15:25 but there is more to Ensemble learning
15:28 what happened just now in random Forest
15:30 is known as parallel way of learning
15:33 okay parallel way of learning
15:37 parallel way of learning why parallel
15:40 way of learning guys because here tree
15:44 one and three two and three three are
15:45 independent of each other when you call
15:47 a random forest model 31 can start
15:50 building by taking a sub sample of the
15:52 data 3 2 can start building by taking a
15:54 subsample of the data they are not
15:55 dependent on each other okay so all
15:58 these things can happen parallely hence
16:00 we call it a parallel learning now the
16:03 question is is there another way of
16:04 learning in Ensemble yes there comes our
16:07 next algorithm known as add a boost okay
16:10 Ada boost standing for adaptive boosting
16:14 so what Ada boost will do is
16:16 let me write the data here let me write
16:19 the data one more time and I may be
16:21 writing some different numbers
16:23 so that’s not important just
16:25 understanding the concept is important
16:27 okay so 42 I will write 50 000 and let’s
16:31 say 58 I will write 150 000 just as an
16:36 example this is your input data
16:38 so boosting is another technique
16:40 boosting is another technique of
16:43 Ensemble category okay in boosting
16:47 especially at a boost what will happen
16:49 is it will assign a weight to all your
16:52 observations okay suppose this is your
16:55 original data for training salary being
16:56 your target column so initial weights
17:00 initial weights
17:03 okay
17:04 and
17:06 what the initial weights will be it will
17:08 be the same weight for all your records
17:10 for example there are three records so
17:12 one by three I am saying one by three I
17:14 am saying one by three I am saying so
17:16 all the rows are equally important okay
17:19 try to understand the concept guys in
17:21 Ada boost in the beginning first
17:23 iteration all the rows are equally
17:25 important okay but how Ada boost works
17:29 is in the name only there is adaptive it
17:32 adapts to the mistakes of the previous
17:34 model now why I am saying a previous
17:36 model and next model is one thing you
17:39 have to always remember at a boost is a
17:42 sequential learning process you you
17:45 remember how I just now told random
17:46 Forest is a parallel learning process
17:49 so in random Forest
17:51 tree one and three two are independent
17:53 of each other okay it will take a sub
17:55 sample and create it will take a sub
17:56 sample and create nothing to do with
17:58 each other
17:59 but in adoboost or other boosting
18:01 techniques
18:02 it’s a sequential model so there will be
18:04 a multiple models in this so there will
18:07 be multiple models fitted to the data I
18:09 will tell you in a moment what these
18:10 models will be model 1 model 2 model 3
18:14 Model 4 and so on and so forth how many
18:17 ever model comes but it will not happen
18:19 parallely okay it will happen in
18:21 sequence
18:22 now the important thing to understand is
18:24 how this sequence will be generated okay
18:27 so what will happen is this model one
18:31 you can think of as a base model this
18:33 model one you can think of as a base
18:35 model and remember in Ada boost your
18:38 decision trees will look like stumps
18:40 stumps means there will be a tree like
18:42 this and there will be another tree like
18:44 this so it will the depth of the tree
18:47 will not be Beyond one level okay so
18:49 this is called stumps in the language of
18:51 machine learning
18:52 so multiple stems will be created now
18:55 suppose your model 1 is this first stump
18:58 what is your model one guys this first
19:00 stump okay
19:02 model one comes and make some prediction
19:05 about the salary model one comes and
19:08 make some predictions about this salary
19:09 okay so what we will have is another
19:12 column called as salary underscore
19:16 prediction and where from this
19:18 prediction Comes This prediction comes
19:19 from model one the first model okay so
19:22 obviously there will be some mistakes so
19:24 22 000 may be said as 21 900 and 50 and
19:29 150 can be said as 50 can be said as
19:32 let’s say 52 000 okay and 150 can be
19:36 said as let’s say two hundred thousand
19:37 based on this first model first decision
19:40 tree that it is creating which I am
19:42 calling a system so there will be some
19:45 differences between actual and predicted
19:48 and from this there will be a residual
19:51 coming residual means errors right
19:53 residual means errors okay so what will
19:56 be the errors 21 900 minus 22 000 right
20:00 so it will be for example I can say a
20:03 hundred
20:04 actual minus predicted it is minus two
20:06 thousand and it is minus minus 50 000
20:10 because we have put okay so this is the
20:13 errors these are the actual values and
20:16 the first model what it predicts right
20:18 those are the errors from the first
20:20 model OKAY twenty two thousand minus
20:22 twenty one nine hundred is one hundred
20:24 and so on and so forth
20:26 now
20:27 these are the initial weights okay
20:29 so what will happen in the next model
20:31 when the M2 is fitted right these
20:34 initial weights will be changed and more
20:37 preference will be given to the
20:39 observations where these residuals are
20:41 more okay I am repeating one more time
20:43 guys M1 will predict this and then
20:47 residuals or errors will come when the
20:49 M2 is trained right then the weights
20:52 will not be same for all these three
20:54 records rather weight will be increased
20:57 for this because you are getting more
21:00 errors here and weight will be decreased
21:03 for this because you are getting less
21:04 error here okay
21:06 and so on and so forth M2 will come
21:09 compute create the residual then again
21:12 weights will be adjusted M3 will come
21:14 predict residual will be calculated
21:16 weights will be adjusted and finally
21:19 what you will get is a combination of
21:22 what will be your final model your final
21:25 model will be a combination of base
21:29 model I am calling it the first model
21:30 okay plus M1 plus M2 plus M3 plus so on
21:37 and so forth remember this this is not a
21:40 mathematical equation this is just
21:41 indicative equation I am giving you okay
21:43 if you want to understand more
21:45 mathematics behind it please go ahead
21:47 and click on the link I’m giving you in
21:49 the description okay and all these
21:52 things will not have equal say in the
21:54 final output their say also will be
21:56 different in the final output for
21:58 example in random Forest you saw all the
22:01 models have equal C in the final output
22:03 we are dividing by 500 okay
22:05 but here all these models will not have
22:07 equal say they will have an equal say
22:09 okay
22:10 let’s move ahead to another what is the
22:13 pros and cons for this model again this
22:15 model will give you a may give you a
22:17 better result than most of the models
22:18 because it is adapting to the changes
22:20 but if you have a larger data side it
22:23 may it may need more resources to train
22:25 and also it is a one kind of Black Box
22:28 model some kind of Black Box model means
22:31 you don’t have much explanation of what
22:32 is going on inside apart from some hyper
22:34 parameters okay
22:36 let’s move ahead to the last algorithm
22:38 integration category known as gradient
22:40 boost okay what is the last algorithm
22:43 integration category gradient boost
22:45 remember guys all these algorithms that
22:47 I’m explaining you I have not taken
22:49 anything that is used less all are used
22:51 more only okay
22:53 so I will take a simple data age
22:56 salary is 21 salary let’s say 20K is 40
23:01 salaries let’s say 42k is 58 salary is
23:06 let’s say 60k this is your input data
23:08 and you want to run a gradient boost on
23:09 this
23:10 what will happen is understand guys this
23:13 is again a sequential learning not a
23:15 parallel learning okay so there will be
23:18 a base prediction
23:20 for all these data base prediction okay
23:23 base prediction
23:24 what is the base prediction guys base
23:26 prediction is nothing but it’s a kind of
23:29 dumb model it will assume that for all
23:31 these guys it will be a average of you
23:34 know all these three records so what is
23:36 the average of this uh 80 plus 42.
23:40 80 plus 42 divided by 3 right so 2 1 1
23:45 2. right let’s say assume for Simplicity
23:49 this is 36k okay so the base prediction
23:52 will be put here 36k 36k 36k one is the
23:57 base prediction comes then there will be
23:59 a residual computed okay residual will
24:02 be the difference between actual and
24:03 predicted values whatever these numbers
24:05 are
24:06 fine now comes the interesting part how
24:09 gradient boost is different from Ada
24:11 boost or other algorithms so what
24:14 gradient boost will do is it will try to
24:17 fit a model on this residual okay it
24:21 will try to fit a model on this residual
24:23 and try to minimize these residuals so
24:26 that will be called as a base model okay
24:29 and then there will be next model you
24:33 can call it residual model one okay and
24:36 then there will be a next model you can
24:38 call it residual model 2 and so on and
24:41 so forth okay so what will happen is
24:43 residuals will be computed and then
24:46 whatever the residual comes based on
24:48 that base prediction will be updated so
24:51 for example let’s say your residual here
24:53 is how much 20 minus 36 minus 16 is your
24:57 residual right
24:58 so this will act as a independent column
25:02 and this residual will act as a Target
25:04 column
25:05 and then let’s say in the prediction
25:07 this minus 16 is is comes as let’s say
25:11 minus 10. so what will happen is this
25:14 base prediction will get updated by this
25:16 this base prediction will get updated
25:18 again it’s a complicated model if you
25:21 want to understand more details there
25:23 are links in the description please
25:24 click on that it will be very clear to
25:26 you okay so what will happen base model
25:29 plus residual model 1 plus residual
25:31 model 2 so on and so forth and there
25:34 will be some parameters which will
25:35 assign weight to all these models so as
25:37 I say all these models will not have
25:39 equal vote in the final output there
25:41 will be a different votes in this fine
25:43 so this is about gradient boost one of
25:45 the famous algorithm for winning kaggle
25:47 competitions and most of the things so
25:49 gradient boost and there is another
25:51 variant of gradient boost known as xgb
25:54 extreme gradient boost please go ahead
25:56 and read about this algorithm guys I am
25:58 not covering because there is a slight
26:00 difference between gradient boost and
26:01 sgb you can read about that as well fine
26:04 let’s move ahead to the second category
26:06 of algorithms known as classification
26:08 algorithms so in classification
26:10 algorithms the first algorithm that I am
26:12 going to cover is logistic regression
26:15 now very very important guys please pay
26:17 attention here and try to understand how
26:19 logistic regression is going to work for
26:22 any given scenario it’s a mathematical
26:24 model hence it is important for you to
26:27 understand okay suppose this is an
26:29 employee data and you have 21 22k
26:33 whether the employee leaves the
26:35 organization or does not leave the
26:36 organization just I am saying 1 0 okay
26:39 and then 40 year guy makes let’s say 42k
26:42 leave 0 no 58 year guy makes let’s say
26:45 60k just for example leaves know one so
26:49 this is a classification problem where
26:51 we are trying to predict whether a
26:53 employee will leave the organization or
26:54 does not leave the organization the last
26:57 column that you see is your target
26:58 column the last column that you see is
27:00 your target column this type of problem
27:03 is called a classification problem
27:05 because what this what the objective of
27:07 this model is tomorrow I give you age of
27:09 the employee for example 31 salary of
27:12 the employee for example 34k and I asked
27:15 to the model hey Will the guy leave or
27:17 not leave the organization okay so this
27:20 is a classification problem how logistic
27:22 regression will take this problem is we
27:24 have we have to understand some
27:25 mathematical Concepts here so if you see
27:28 here the target column is 1 0 only so
27:32 that is either one or zero one or zero
27:34 okay so which means that Y which is our
27:38 Target can be understand this is very
27:40 important concept guys
27:43 can be either 0 or 1 it cannot be
27:47 anything else your target cannot be
27:49 anything else apart from 0 or 1 but your
27:51 age and salary can take any real number
27:54 X can be
27:57 any value between minus infinity to plus
27:59 infinity right
28:01 so X can be any value between minus
28:03 infinity 2 plus infinity y can be only 0
28:05 or 1 okay
28:07 so what we have to understand here is we
28:11 have to somehow create a relation that
28:13 will enable us to predict y given X okay
28:16 the problem here is on the left hand
28:19 side we have minus infinity to plus
28:22 infinity range that is X range okay so I
28:24 will write here x x means independent
28:27 features on the right hand side your
28:30 values can be only 0 to 1 0 or 1 not 0
28:33 to 1 okay
28:35 so what we do is we do not directly
28:38 predict y rather we predict something
28:40 else what is that something else that we
28:43 predict so in place of predicting y we
28:47 predict probabilities okay probabilities
28:50 of an observation falling in y
28:52 probabilities
28:54 l i t i e s Okay so
28:57 what we will do is we will predict
28:59 probabilities then the range will be 0
29:01 to 1 as you know probability can take
29:03 the range between 0 to 1 okay
29:06 now this range is also not what we are
29:08 looking for minus infinity to plus
29:10 infinity so what we will do is we will
29:12 do one more transformation and we will
29:14 make this as a odds so what is the range
29:16 of odds 0 to Infinity okay but still we
29:20 are not minus infinity 2 plus infinity
29:21 range so what we will do is we will take
29:24 log of odds okay log of odds
29:29 okay and then the range will become
29:31 minus infinity to plus infinity
29:33 how your equation will look like here is
29:36 when you say log of odds right so on the
29:39 right hand side it will be log of P by 1
29:42 minus p
29:43 okay on the left hand side you will have
29:46 beta node plus beta 1 x 1 plus so on and
29:51 so forth okay this equation that you see
29:54 in front of you now is called the base
29:56 equation for the logistic regression now
29:58 one important concept to understand here
30:00 guys this is a logic function okay and
30:04 inverse of logit function H sigmoid
30:06 function okay support suppose you take
30:10 the inverse of this or sigmoid of this
30:12 so what will happen is if you apply
30:14 Sigma at both sides so if you don’t know
30:16 what is sigmoid function then sigmoid
30:18 function f x looks like this 1 by 1 plus
30:22 e to the power minus X this is your
30:24 sigmoid function on XY plane how it will
30:27 look like is this suppose this is your 0
30:28 this is your 1 and this is your 0.5 okay
30:32 so sigmat will look like this so it will
30:36 always be between 0 to 1 okay so your
30:40 logistic regression this equation will
30:42 be changed in the form of sigmoid
30:44 function so your f x or P okay P will
30:49 look like if you take if you take
30:51 sigmoid on both sides right then on the
30:54 right hand side you will just have p and
30:56 here we will have 1 by 1 plus e to the
30:59 power minus
31:00 beta0 plus beta 1 x 1 okay remember guys
31:05 this equation is equation 1 and this
31:09 equation is equation to both the
31:11 equations are same the difference is
31:14 this is a logit equation and this is a
31:16 sigmoid equation okay now take it if you
31:20 if you take a inverse of logit that is
31:22 nothing but sigmoid okay understand this
31:24 carefully and now this equation from our
31:27 example how it can be written is 1 by 1
31:31 plus e to the power minus beta 0
31:34 plus beta 1 into age okay plus beta 2
31:40 into salary
31:42 this is nothing but your logistic
31:44 regression equation okay and as you know
31:46 as I told you this is a sigmoid function
31:48 so the output of what you see here
31:51 output of this will always be between 0
31:54 to 1 which means you can get a
31:56 probability and then you can say that
31:58 based on this probability I can say
32:00 whether the employee leaves or does not
32:02 leave okay logistic regression is again
32:05 a very important and not easy to
32:07 understand concept okay so as you can
32:10 see we are modeling a categorical
32:12 variable against real numbers hence we
32:15 need to do certain Transformations these
32:17 are the Transformations that we need to
32:19 do and how it relates to the probability
32:21 I just explained you now okay pros and
32:24 cons mathematical model not very
32:26 difficult to understand cons it again
32:28 assumes a lot of things about the data
32:30 which may or may not be correct hence it
32:33 may not give a great result all the time
32:35 okay but very famous and very important
32:38 algorithm to understand
32:40 next algorithm in the category in the
32:43 classification category one simple one I
32:45 want to cover here that is known as gear
32:47 nearest neighbor okay it’s a pretty
32:49 simple algorithm suppose in the same
32:51 data on this data you want to build a k
32:54 n algorithm okay so since I have data
32:57 here so I will explain here only so what
32:59 can happen is it will plot a x-ray plane
33:01 like this okay and it’s a
33:04 three-dimensional data so you can have
33:06 one more axis for salary
33:08 or you can have two access only because
33:10 from two axis we can we can predict okay
33:12 so let’s come here age and let’s say
33:15 here salary okay
33:17 out of these three employees let’s say
33:19 one employees 21 22 employees Falls here
33:22 and second employee 40 Falls here and 58
33:26 Falls somewhere here okay so what K
33:28 nearest neighbor will do is it will try
33:31 to allocate neighbors to all these
33:33 individual observations for example this
33:37 is your observation one this is your
33:38 observation two and this is your
33:39 observation three okay so one does not
33:43 has any neighbors but 2 is the neighbor
33:46 of 3 and 3 is the neighbor of two okay
33:48 so tomorrow some prediction comes for
33:51 let’s say age 50 again I will take 50
33:53 example
33:54 50 example
33:55 so what it will do is it will try to see
33:58 and I will take salary also because in
34:00 this case salary is also there so salary
34:02 is let’s say 61k okay so what it will do
34:06 is it will try to see where can I fit
34:08 this 58 percent and salary 61k Maybe
34:13 who are the nearest neighbor to that guy
34:15 so the nearest neighbor to that guy may
34:17 be this guy and this guy suppose that
34:19 new guy comes somewhere here okay so who
34:22 are the neighbors for this this is the
34:24 first neighbor this is the second
34:25 neighbor okay so it will simply go ahead
34:28 and take the you know mode of results
34:31 for example these these two guys are the
34:34 are the second neighbors right I mean
34:36 two neighbors of that so it will take 0
34:38 and 1 which is maximum so in this case
34:40 there is no mode of the data but
34:42 obviously if you take a larger data
34:44 there will be modes of the data okay so
34:46 whichever mode for example Suppose there
34:48 are 30 records out of that 20 is 1 and
34:50 10 is 0. so the prediction for this guy
34:52 will be whatever is maximum or whatever
34:55 is mode so if the mode is one or zero
34:58 whatever it is that will be the
34:59 prediction for k n okay so as I told you
35:03 Cannon is a pretty simple algorithm it
35:05 will just plot your data try to find the
35:07 nearest neighbors and then when a new
35:09 observation comes you give how many how
35:12 many Observer how many neighbors you
35:13 want for that record and it will create
35:15 one based on that okay so Canon is a
35:18 simple to understand algorithm nothing
35:19 complex in that so I covered quickly in
35:21 that that slide itself okay now let’s
35:24 try to understand another classification
35:26 technique known as support Vector
35:28 machines or svms
35:30 so what svms will do is it will plot
35:33 your data in whatever axis you have
35:36 suppose age is one axis and salary is
35:38 one axis okay
35:40 and your data points I will take little
35:42 more data points okay your data points
35:45 look like this so these are some data
35:47 points and this is these are some more
35:49 data points okay
35:51 so what sbm will do is it will try to
35:54 create something known as a decision
35:56 boundary okay how this decision boundary
35:59 is different from linear regression
36:00 decision boundary is in any integration
36:03 there is a pure mathematical equation
36:04 involved here there is a concept of
36:07 something known as a hyper plane okay
36:08 for example if I draw a line between
36:11 this right so all these guys black guys
36:14 you can think leaves or Target column is
36:17 one
36:17 all these guys you can think does not
36:20 leave or Target column is zero does not
36:22 leave
36:24 okay
36:25 suppose your data is like this so what
36:28 will happen is your svm will plot this
36:30 is called in the language of sbm this is
36:32 called a decision boundary okay decision
36:35 boundary so in this case your data looks
36:38 pretty simple pretty separated hence the
36:40 decision boundary can be as simple as a
36:43 line okay but in most of the scenarios
36:46 real world scenarios decision boundary
36:49 cannot be as simple as this okay so
36:51 there will be some black dots here there
36:54 will be some black circles here okay and
36:57 there can be some this Cross Blue Cross
37:00 this side right so in this case decision
37:03 boundary is not doing Justice so
37:05 decision boundary need to change and
37:08 that is where the concept of hyper
37:10 planes and kernels two very important
37:12 Concept in svm guys if you want to
37:14 explore more on sbm hyper planes okay
37:18 and kernels
37:20 so when your data become complex then
37:23 simple decision boundaries cannot
37:25 predict it well okay so you need to have
37:28 a have a complex decision boundary and
37:30 that is where hyperplane and kernels
37:32 concept come but just to give you an
37:34 idea of how svm works it will create a
37:37 decision boundary and tomorrow any
37:39 prediction any new result come for
37:41 example somebody asks what is the um you
37:44 know for a person with for a person with
37:46 age 50
37:47 and salary is 60k whether the person
37:50 will leave or not leave so this svm
37:53 model will see on which side of decision
37:55 boundary this guy is falling if this guy
37:57 falls on this side of decision boundary
37:58 it says do not leave if this guy falls
38:01 on this side of decision boundary it
38:02 says leaves okay so in svm remember
38:05 concept of decision boundaries hyper
38:08 planes kernels and kernel tricks okay
38:11 so we have covered three things from the
38:14 classification scenarios and five things
38:17 from the regression scenarios let’s go
38:20 ahead and try to see some unsupervised
38:22 learning problems okay so what is the
38:24 meaning of unsupervised learning till
38:26 now we are having a Target column but in
38:29 unsupervised learning we may not have a
38:31 Target column okay suppose for the same
38:33 employee data we have age and salary
38:37 but somebody comes to you and tells you
38:39 that hey can you tell me if there are
38:42 different buckets of employees existing
38:44 in my organization
38:45 different buckets means some people with
38:48 less age and more salary some people
38:50 with more is endless salary so are there
38:53 different buckets somebody can can come
38:56 and ask you okay
38:58 so how you will solve that problem is by
39:01 using something knowledge clustering or
39:02 segmentation okay so suppose the task in
39:06 hand is here there are three records
39:08 only but there can be more records right
39:10 in the real world scenario what I am
39:12 interested in knowing is if there are
39:14 natural clusters in my organization so
39:17 this is my organization data on one axis
39:20 I have is on other axis I have salary
39:22 okay and I have multiple data points
39:26 here three data points only but I am
39:28 plotting more data points just for
39:30 demonstration okay so there is nothing
39:33 to predict but employed is interested in
39:36 knowing if there are buckets means if
39:37 few employees are closer to each other
39:39 in terms of their characteristics so for
39:42 example these employees are closer you
39:43 can call bucket one these employees are
39:45 closer you can call bucket two or
39:47 segment 2. okay but how this will be
39:50 implemented is in K means clustering so
39:54 one technique for implementing bucketing
39:55 is K means clustering okay there can be
39:57 other techniques also for segmentation
39:59 or bucketing one technique is K means
40:01 clustering in this technique what will
40:03 happen is the distance between the
40:06 various employees will be computed for
40:08 example this is your employee one and
40:11 this is your employee two okay suppose I
40:13 ask you how similar is employee one from
40:16 employee two so there can be different
40:18 similarity metric that you can compute
40:21 for example euclidean distance or
40:23 Manhattan distance or cosine similarity
40:25 Etc I have detailed video on these
40:28 things as well I will link it in the
40:29 description but suppose I tell you a
40:32 simple uh you know how the distance how
40:35 the similar Sim how these two employees
40:36 are similar or different so you will say
40:39 21 minus 40 whole Square
40:42 plus 20K 20K
40:45 minus 42k whole Square so on all the
40:48 dimensions you are taking the distance
40:50 between them squaring it and under
40:52 rooting it this is called euclidean
40:53 distance between E1 and U2 whatever
40:55 number you get it okay
40:57 so suppose E1 and E2 equilibrium
41:00 distance is less and E1 and E3
41:03 equilibrium distance is more so in that
41:04 case you say E1 and E2 are closer to
41:07 each other okay and in the similar way
41:10 you start finding the employees which
41:11 are closer to each other and then you
41:14 call this as one bucket similarly this
41:16 score you call is at an another bucket
41:18 okay remember I have explained you in
41:21 simple terms but there is a very
41:23 important Concept in k-means known as
41:25 centroid concept okay so please go ahead
41:28 and watch unfold data science detailed
41:31 the video on k-means clustering you will
41:33 understand all the details of how
41:35 centroid is defined and how this
41:36 algorithm works at a mathematical level
41:38 okay I will link that video please
41:41 ensure you watch that
41:43 so this is about k-means clustering now
41:46 last but not the least guys you might
41:48 have seen in Amazon and Flipkart that
41:51 there are different different uh
41:52 products that is recommended to you for
41:54 example if you buy a laptop then it will
41:57 tell you hey go ahead and buy this
41:59 laptop bag as well so this is nothing
42:01 but a recommendation okay
42:04 in the Netflix if you watch let’s say
42:06 one movie one action movie let’s say if
42:09 you watch Mission Impossible then it
42:12 will go and recommend you Jack Ryan
42:13 series maybe okay
42:16 so this is called a recommendation
42:18 system that is running in background
42:20 okay so how this system works one simple
42:23 uh yet powerful technique for
42:26 recommender system is known as
42:28 collaborative filtering collaborative
42:30 filtering
42:32 okay
42:33 so what collaborative filtering does is
42:36 it will take users okay users
42:40 and it will take items
42:43 try to understand this simple concept
42:44 Edge it’s pretty simple to understand so
42:47 users can be a month
42:49 and users can be John
42:51 and users can be do okay and in the
42:55 items we can have let’s say Mission
42:58 Impossible
42:59 in the atoms we can have Jack Ryan in
43:02 the atoms we can have another any movie
43:04 of James Bond Series in the atoms we can
43:07 have Spiderman okay in the atoms we can
43:09 have any comedy movies for example home
43:12 alone you can say okay
43:14 so Aman which movie Aman watches or
43:17 which movie Aman has watched for example
43:19 Mission Impossible Aman has watched Jack
43:21 Ryan he has watched but he has not
43:24 watched let’s say this movie Zero I will
43:28 say okay James one movie and this movie
43:31 he has not not uh watched okay
43:33 Spider-Man movie there is another guy
43:35 John who has watched Mission Impossible
43:38 Jack Ryan James Bond movie and
43:41 Spider-Man movie as well
43:43 there is another guy doe who has not
43:45 watched any of these movies but has
43:47 watched Home Alone the comedy movie okay
43:52 so
43:53 the which users are similar to which
43:56 user will be computed based on one of
43:59 the user similarity metric so what are
44:01 the user similarity metric I told you
44:03 cosine similarity it can be different
44:05 kind of distance metric so as you can
44:08 think from the common sense also here
44:10 Aman watches action movies if you can
44:12 see here and John also watches action
44:15 movies more
44:18 Mission Impossible and Jack Ryan but
44:21 Aman has not watched James one movie and
44:23 Aman has not watched Spider-Man movie so
44:26 what will happen is since Aman and Jon
44:29 are similar to each other so go ahead
44:31 and watch the movies that Jon has
44:34 watched but Aman has not watched because
44:37 Aman and Jon both tastes are similar so
44:40 go ahead and recommend what John has
44:43 watched but Aman has not watched so what
44:45 will be the recommendation going to Aman
44:47 James Bond movie and Spider-Man movie
44:50 Okay now imagine this is a large metric
44:53 of large users and large items so it
44:57 will be seen which users tastes are
45:00 similar to each other okay and then
45:03 the other user which has not watched
45:05 that movie will be recommended the
45:07 movies or series based on the similar
45:09 users watching history okay this is
45:13 pretty simple but powerful technique
45:15 known as collaborative filtering
45:17 so let’s revise once guys what all we
45:20 discussed long discussion but very very
45:22 fruitful for you to revise few
45:24 fundamental concepts linear regression
45:26 decision tree random Forest data boost
45:28 gradient boost for segregation we
45:29 discussed classification I explained you
45:32 logistic regression how svm works and
45:34 how k n works and I explained you two
45:37 unsupervised technique came instant
45:38 collaborative filtering now not in too
45:40 much detail I went because it’s not
45:42 possible to go in all the details of 10
45:45 algorithms in short time but read this
45:47 as a refresher and please go ahead and
45:50 click all the links of whichever
45:51 algorithm you are more interested in
45:54 learning all the videos are there on all
45:56 four data science okay
45:58 I request you guys please press the like
46:00 button if you like this video and please
46:01 press the Subscribe button and the bell
46:03 icon
46:04 if you want me to create more videos
46:06 like this see you all in the next video
46:07 guys wherever you are stay safe and take
46:09 care

MachineLearningAlgorithms

20 COMMENTS

  1. please explain base model in adaBoost . It sounds similar to M1 model. is it different from M1 model. if it is so, what is the difference. Kindly explain. But great explanation.Keep up the good work sir. God bless

  2. Need your help understanding a scenario where the OA and kappa coefficient are more or less similar on test and validation datasets when using only one independent variable. Here, the validation dataset meaning completely a new dataset in time and space. Train and Test belong to same time and space. Can you explain to me why this is? I appreciate your help on this. When run with a few more variables, this issue is not showing up.

    For more understanding, Train and Test are from same day satellite image for city A. Validation dataset is from different day satellite image for City B.

LEAVE A REPLY

Please enter your comment!
Please enter your name here