Home Business Analyst BA Agile Coach Reinforcement Learning Series: Overview of Methods

Reinforcement Learning Series: Overview of Methods

56
20

Welcome back so i’ve started this video lecture  series on reinforcement learning and the last   three videos were at a very high level kind of  what is reinforcement learning how does it work uh   what are some of the applications but we really  didn’t dig into too many details on the actual  

Algorithms of how you implement reinforcement  learning in practice and so that’s what i’m   actually going to do today and in this next part  of this series is something i hope is going to be   really really useful kind of for all of you which  is the  

The first thing is i’m going to kind of organize the different approaches of reinforcement learning This is a massive field that’s about 100 years old This merges neuroscience, behavioral science like Pavlov’s dog, optimization theory, optimal control think Bellman’s equation  and the Hamilton Jacobi Bellman equation   all the way to modern day deep reinforcement learning

Which is kind of how to use powerful machine learning techniques to solve these optimization problems and you’ll remember that in my view of reinforcement learning this is really at the intersection of machine learning and control theory so we’re essentially machine learning good effective control strategies to interact with in a an environment

So in this first lecture what i’m gonna do and i think i’m hoping that this is actually super useful for some of you   is i’m going to talk through the organization  of these different decisions you have to make   and kind of how you can think about the landscape  of reinforcement learning.

Before going on I want to mention this is actually a chapter in the  new second edition of our book data driven science   and engineering with myself and Nathan Kutz and  reinforcement learning was one of the new chapters  

I decided to write so this is a great excuse  for me to get to learn more about reinforcement learning and it’s also a nice opportunity for me  to kind of get to communicate more details to you  

So if you want to download this chapter the link  is here, and I’ll also put it in the comments below   and i’ll have a link to the second edition of the  book uh up soon as well probably in the comments  

Good so um a new chapter you can  follow along with all of the videos   and each video kind of um you  know follows follows the chapter   good so before i get into that organizational  chart of how you know all of these different  

Types of reinforcement learning can be thought  of i want to just do a really really quick recap   of what is the reinforcement learning problem  so in reinforcement learning you have an agent   that gets to interact with the world or the  environment through a set of actions sometimes  

These are discrete actions sometimes these are  continuous actions if i have a robot i might have   a continuous action space whereas if i’m playing  a game if i’m the you know the white pieces on a  

Chess board then i have a discrete set of actions  even though it might be kind of high dimensional   and i observe the state of the system at each  time step i get to observe the state of the system  

And use that information uh to change my actions  to try to maximize my current or future rewards   uh through through playing and i’ll mention that  in lots of applications for example in chess   the reward structure might be quite sparse i might  not get any feedback on whether or not i’m making  

Good moves until the very end when i either  win or lose tic-tac-toe backgammon checkers   go are all kind of the same way and that delayed  reward structure is one of the things that makes   this reinforcement learning problem really really  challenging it’s what makes uh you know learning  

In animal systems also challenging if you want to  teach your dog a trick you know they have to know   kind of step by step what you want them to do  and so you actually sometimes have to give them  

Rewards at intermediate steps to train a behavior  and so the agent their control strategy or their   their policy is typically called pi and it  basically is a probability of taking action a   given a state s a current state s so this could be  a deterministic policy it could be a probabilistic  

Policy but essentially it’s a set of rules that  determines what actions i as the agent take given   uh what i sense uh in the environment to maximize  my future rewards so that’s the policy and again   usually this is written in a probabilistic  framework because typically the environment  

Is written as a probabilistic model and there  is something called a value function so given   um you know some policy pi that i take then i can  associate a value with being in each of the states  

Of this system essentially by what is my expected  future reward add up all of my future rewards   what’s the expectation of that and we’ll put in  this little discount factor because future rewards   might be less uh advantageous to me than than  current rewards this is just something that people  

Do in economic theory it’s kind of like a you  know utility function and so for every policy pi   there is a value associated with being in each  of the given states s now again i’ll point out   for for even reasonably sophisticated problems you  can do this for tic-tac-toe you can enumerate all  

Possible states and all possible actions and you  can compute this value function kind of through   brute force but even for moderately complicated  games even like checkers let alone back game in   her chess or go this state space the the the  space of all possible states you could observe  

Your system in is astronomically large i think  it’s estimated that there’s 10 to the 80 plus um   maybe it’s 10 to 180 africa it’s a huge number of  possible chess boards even more possible go boards   and so you can’t really actually enumerate  this value function but it’s a good um kind  

Of abstract function that we can think about are  these policy functions and these value functions   and at least in simplistic dynamic programming  we often assume that we know a model of of our   environment and i’ll get to that in a minute so  the goal here the entire goal of reinforcement  

Learning is to learn through trial and error  through experience what is the optimal policy   to maximize your future rewards okay so notice  that this value function is a function of policy   pi i want to learn the best possible policy that  always gives me the most value out of every board  

Position out of every state and that’s a really  hard problem it’s easy to state the goal and it   is really really hard to solve this problem that’s  why it’s been uh you know this growing field for   100 years that’s why it’s still a growing field  because we have more powerful emerging techniques  

In machine learning to start to solve this problem  so that’s the framework i need you to know what   the the policy is that is the set of kind of rules  or or uh you know controllers that i as an agent  

Get to take to manipulate my environment the value  function tells me how valuable it is to be in a   particular state so i might want to move myself  into that state so i need you to know kind of this  

Nomenclature so now i can kind of show you how all  of these techniques are organized okay so that’s   what i’m going to do for the rest of the video  is we’re going to talk through kind of the key   organization of all of the different uh like  like mainstream types of reinforcement learning  

Okay so the first biggest dichotomy is between  model-based and model-free reinforcement learning   so if you actually have a good model of your  environment you have some you know markov decision   process or some differential equation if you have  a good model to start with then you can work in  

This model based reinforcement learning world now  some people don’t actually consider what i’m about   to talk about reinforcement learning but but i  do okay so for example if my environment is a   markov decision process which means that there is  a probability kind of a deterministic probability  

That sounds like an oxymoron but if there is  a specified probability of moving from state s   to the next state s prime given action a and  this probability function is known so it’s a   it doesn’t depend on the history of your actions  and states it only depends on the current state  

And the current action determines a probability of  going to a next state s prime then uh two really   really powerful techniques that i’m going to tell  you about to optimize the the policy function pi   is policy iteration and value iteration these  allow you to essentially iteratively walk through  

Um the game or or the markov decision process  taking actions that you think are going to   be the best and then assessing what the  value of that action and state actually   are and then kind of refining and iterating  the policy function and the value function  

So that is a really really powerful approach  if you have a model of your system you can   run this kind of on a computer and and kind of  determine learn what the best policy and value is   and this is kind of a special case of dynamic  Programming that relies on the bellman optimality  

Condition for the value function  so i’m going to do a whole lecture   on this blue part right here we’re going to  talk about policy iteration and value iteration   and how they are essentially dynamic using dynamic  programming on the value function which satisfies  

Bellman’s optimality condition now that was for  probabilistic uh processes things where maybe   you know like backgammon where there’s a dice  roll at every turn for deterministic systems like   a robot system or a self-driving car if i think  about a human you know my reinforcement learning  

Problem this is much more of a continuous control  problem in which case i might have some nonlinear   differential equation x dot equals f of x comma u  and so the linear optimal control that we studied   i guess in chapter 8 of the book so linear  quadratic regulators common filters things like  

That optimal linear control problems are special  cases of this optimal non-linear control problem   with the hamilton jacobi bellman equation  again this relies on bellman optimality and   you can use kind of dynamic programming ideas  to solve optimal nonlinear control problems   like this now i’ll point out mathematically  this is a beautiful theory it’s powerful  

It’s been around for decades and it’s you know  kind of the textbook way of thinking about how to   design optimal policies and optimal controllers  you know for markov decision processes and for   non-linear control systems in practice actually  solving these things with dynamic programming  

Ends up usually amounting to a brute force search  and it’s usually not scalable to high dimensional   systems so typically it’s hard to do this optimal  hamilton jacobi bellman type non-linear control   for an even moderately high dimensional system  you know you can do this for a three-dimensional  

System sometimes a five-dimensional system maybe  you know i’ve heard special cases with with   machine learning you can do this maybe for a 10 or  100 dimensional system but you can’t do this for   the nonlinear fluid flow equations which might  have you know a hundred thousand or a million  

Dimensional differential equation when you write  it down on your computer so important caveat there   but that’s model based control and a lot of what  we’re going to do in model free control uses ideas   that we learned from model based control so even  though you know i don’t actually do a lot of this  

In my daily life with reinforcement learning  most of the time we don’t have a good model   of our system for example in chess i don’t have  a model of my opponent for example or at least i   can’t write it down mathematically as a markov  decision process so i can’t really use these  

Techniques but a lot of what model free control  reinforcement learning is going to do is kind of   approximate dynamic programming where you’re  simultaneously learning kind of the dynamics   or learning to update these these functions  through trial and error without actually having  

A model and so in model-free reinforcement  learning kind of the major dichotomy here   is between gradient-free and gradient-based  methods and i’ll tell you what this means in a   little bit but for example if i can parameterize  my policy pi by some variables theta and i know  

Kind of what the dependency with those variables  theta are i might be able to take the gradient   of my reward function or my value function  with respect to those parameters directly   and speed up the optimization okay so gradient  based if you if you can use it is usually going  

To be the fastest most efficient way to do  things but oftentimes again we don’t have   gradient information we’re just playing games  we’re playing chess we’re playing go and i can’t   compute the derivative of one game with respect  to another that’s hard for me at least to do  

And so within gradient free okay there’s a lot  of dichotomies here there’s a dichotomy of a   dichotomy of a dichotomy within gradient free  control there is this idea of sometimes you can   be off policy or on policy and it’s a really  important uh distinction what on policy means  

Is that let’s say i’m playing a bunch of games  of chess i’m trying to learn an optimal policy   function or an optimal value function or both by  playing games of chess and iteratively kind of  

Refining my estimate of pi or a v what on policy  means is that i always play my best game possible   whatever i think the value function is and  whatever i think my best policy possible is  

I’m always going to use that best policy as i play  my game and i’m going to always try to kind of get   the most reward out of my system every game  i play that’s what it means to be on policy  

Off policy means well maybe i’ll try some things  maybe maybe i know that my policy is suboptimal   and so i’m just going to do some like random moves  occasionally that is called off policy because i   think they’re sub-optimal but they might be really  valuable for learning information about the system  

Uh so on policy methods include this  sarsa state action reward state action   and there’s all of these variants of the sarsa  Algorithm this on policy reinforcement learning   and these tds mean temporal difference and mc  is monte carlo and so there’s this whole family  

Of kind of gradient-free optimization techniques  that use different kind of amounts of history i’ll   talk all about that that’s going to be a whole  other lecture is this this red box gradient free   model free reinforcement learning and so the off  policy version of sarsa kind of this on policy  

Set of algorithms there is an off policy variant  called q learning and so this quality function   q is kind of the joint value if you  like of being in a particular state   and taking a particular action a so this quality  function contains all of the information of my  

My optimal policy and the value function and both  of these can be derived from the quality function   but the really important distinction is that when  we learn based on the quality function we don’t   need a model for what my next state is going to be  this quality function kind of implicitly defines  

The value of you know based on where you’re going  to go in the future and so q learning is a really   nice way of learning when you have no model and  you can take off policy information and learn  

From that you can take a sub-optimal controller  just to see what happens and still learn and get   better policies and better value functions in the  future and that’s also really important if you   want to do imitation learning if i want to just  watch other people play games of chess even though  

I don’t know what their value function is or what  their policy is with these off policy learning   algorithms you can accumulate that information  into your estimate of the world and every bit of   information you get improves your quality function  and it improves the next game you’re going to play  

So really powerful and i would say most of what we  do nowadays you know is kind of in this q learning   world a lot a lot a lot of machine learning is  q learning reinforcement learning is q-learning  

And then the gradient-based algorithms i’m not  going to talk about it too much here but it’s   essentially where you would actually update the  parameters of your policy or your value function   or your q function directly using some kind of a  gradient optimization so if i can sum up all of my  

Future rewards and it’s a function of the current  parameters theta that parameterize my policy   then i might be able to use gradient optimization  things like newton’s steps and steepest descent   things like that to get a good estimate  and this when i have the ability to do that  

Is going to be way way faster uh than any of  these uh these gradient free methods and and   even in term uh will be faster than dynamic  programming and so the last piece of this   is kind of in the last 10 years we’ve had this  massive explosion of deep reinforcement learning  

A lot of this has been because of deep mind and  alphago you know demonstrating that machines   computers can play atari games at human level  performance they can beat grand masters that go   just incredibly impressive demonstrations  of reinforcement learning that now use deep  

Neural networks either to learn a model where you  can then use model-based reinforcement learning   or to represent these kind of model-free concepts  so you can have like a deep neural network for   the quality function you can have a deep neural  network for the policy and then differentiate  

With respect to those network parameters uh using  kind of you know auto diff and back propagation   uh to do gradient based optimization on your  policy network i would say that deep model   predictive control this doesn’t exactly fit into  the reinforcement learning world but i would say  

You know it’s it’s morally very closely related  deep model predictive control allows you to solve   these kind of hard optimal nonlinear problems  and then you can actually learn a policy based   on what your model predictive controller actually  does you can essentially kind of codify that model  

Predictive controller into a control policy and  finally uh actor critic methods um actor critic   methods existed long before deep reinforcement  learning but nowadays they have kind of a renewed   interest uh because you can you can uh you  can train these with with deep neural networks  

Okay so that is the mile-high view as i see it of  the different categories of reinforcement learning   is this comprehensive absolutely  not is it a hundred percent   factually correct definitely not this is you  know a rough sketch of the main divides and  

Things you need to think about when you’re  choosing a reinforcement learning algorithm   if you have a model of your system you can use  dynamic programming based on bellman optimality   if you don’t have a model of the system you can  either use gradient free or gradient based methods  

And then there’s on policy and off policy variants  depending on you know your specific needs it tends   out to be that sarsa methods are more conservative  and q learning will tend to converge faster   and then for all of these methods there are  ways of kind of making them more powerful  

And more flexible representations using uh deep  neural networks in kind of different focused ways   okay so in the next few videos we’ll zoom into you  know this part here for markup decision processes   how we do policy iteration and value iteration  we’ll actually derive the quality function uh here  

We’ll talk about model free control these kind  of gradient free methods on policy and off policy   cue learning is one of the most important ones and  temporal difference learning actually has a lot of   neuro science analog so how we learn in in our  animal brains people think you know is very  

Closely related to these td learning policies  we’ll talk about how you do optimal nonlinear   control with the hamilton jacobi bellman equation  uh we’ll talk very briefly about policy gradient   optimization and then you know all of these  there are kind of deep learning things uh  

We’ll pepper it throughout with deep learning  or maybe i’ll have a whole lecture on on these   deep learning methods so that’s all coming up  really excited to walk you through this thank you
This video introduces the variety of methods for model-based and model-free reinforcement learning, including: dynamic programming, value and policy iteration, Q-learning, deep RL, TD-learning, SARSA, policy gradient optimization, among others.

Citable link for this video:

This is the overview in a series on reinforcement learning, following the new Chapter 11 from the 2nd edition of our book “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” by Brunton and Kutz

Book Website:
Book PDF:
RL Chapter:

Amazon:

Brunton Website: eigensteve.com
00:09 welcome back so i’ve started this video lecture  series on reinforcement learning and the last  
00:16 three videos were at a very high level kind of  what is reinforcement learning how does it work uh  
00:22 what are some of the applications but we really  didn’t dig into too many details on the actual  
00:27 algorithms of how you implement reinforcement  learning in practice and so that’s what i’m  
00:33 actually going to do today and in this next part  of this series is something i hope is going to be  
00:40 really really useful kind of for all of you which  is the  
00:45 The first thing is i’m going to kind of organize the different approaches of reinforcement learning
00:51 This is a massive field that’s about 100 years old
00:54 This merges neuroscience, behavioral science like Pavlov’s dog, optimization theory, optimal control
01:02 think Bellman’s equation  and the Hamilton Jacobi Bellman equation  
01:07 all the way to modern day deep reinforcement learning
01:11 which is kind of how to use powerful machine learning techniques to solve these optimization problems
01:17 and you’ll remember that in my view of reinforcement learning
01:22 this is really at the intersection of machine learning and control theory
01:27 so we’re essentially machine learning good effective control strategies to interact with in a an environment
01:34 So in this first lecture what i’m gonna do and i think i’m hoping that this is actually super useful for some of you  
01:43 is i’m going to talk through the organization  of these different decisions you have to make  
01:47 and kind of how you can think about the landscape  of reinforcement learning.
01:51 Before going on I want to mention this is actually a chapter in the  new second edition of our book data driven science  
01:59 and engineering with myself and Nathan Kutz and  reinforcement learning was one of the new chapters  
02:06 I decided to write so this is a great excuse  for me to get to learn more about reinforcement learning
02:11 and it’s also a nice opportunity for me  to kind of get to communicate more details to you  
02:16 so if you want to download this chapter the link  is here, and I’ll also put it in the comments below  
02:22 and i’ll have a link to the second edition of the  book uh up soon as well probably in the comments  
02:28 good so um a new chapter you can  follow along with all of the videos  
02:32 and each video kind of um you  know follows follows the chapter  
02:37 good so before i get into that organizational  chart of how you know all of these different  
02:41 types of reinforcement learning can be thought  of i want to just do a really really quick recap  
02:47 of what is the reinforcement learning problem  so in reinforcement learning you have an agent  
02:53 that gets to interact with the world or the  environment through a set of actions sometimes  
02:58 these are discrete actions sometimes these are  continuous actions if i have a robot i might have  
03:03 a continuous action space whereas if i’m playing  a game if i’m the you know the white pieces on a  
03:09 chess board then i have a discrete set of actions  even though it might be kind of high dimensional  
03:16 and i observe the state of the system at each  time step i get to observe the state of the system  
03:21 and use that information uh to change my actions  to try to maximize my current or future rewards  
03:29 uh through through playing and i’ll mention that  in lots of applications for example in chess  
03:35 the reward structure might be quite sparse i might  not get any feedback on whether or not i’m making  
03:39 good moves until the very end when i either  win or lose tic-tac-toe backgammon checkers  
03:46 go are all kind of the same way and that delayed  reward structure is one of the things that makes  
03:51 this reinforcement learning problem really really  challenging it’s what makes uh you know learning  
03:57 in animal systems also challenging if you want to  teach your dog a trick you know they have to know  
04:03 kind of step by step what you want them to do  and so you actually sometimes have to give them  
04:09 rewards at intermediate steps to train a behavior  and so the agent their control strategy or their  
04:17 their policy is typically called pi and it  basically is a probability of taking action a  
04:24 given a state s a current state s so this could be  a deterministic policy it could be a probabilistic  
04:30 policy but essentially it’s a set of rules that  determines what actions i as the agent take given  
04:36 uh what i sense uh in the environment to maximize  my future rewards so that’s the policy and again  
04:45 usually this is written in a probabilistic  framework because typically the environment  
04:50 is written as a probabilistic model and there  is something called a value function so given  
04:57 um you know some policy pi that i take then i can  associate a value with being in each of the states  
05:04 of this system essentially by what is my expected  future reward add up all of my future rewards  
05:10 what’s the expectation of that and we’ll put in  this little discount factor because future rewards  
05:15 might be less uh advantageous to me than than  current rewards this is just something that people  
05:20 do in economic theory it’s kind of like a you  know utility function and so for every policy pi  
05:28 there is a value associated with being in each  of the given states s now again i’ll point out  
05:35 for for even reasonably sophisticated problems you  can do this for tic-tac-toe you can enumerate all  
05:41 possible states and all possible actions and you  can compute this value function kind of through  
05:45 brute force but even for moderately complicated  games even like checkers let alone back game in  
05:51 her chess or go this state space the the the  space of all possible states you could observe  
05:57 your system in is astronomically large i think  it’s estimated that there’s 10 to the 80 plus um  
06:04 maybe it’s 10 to 180 africa it’s a huge number of  possible chess boards even more possible go boards  
06:10 and so you can’t really actually enumerate  this value function but it’s a good um kind  
06:16 of abstract function that we can think about are  these policy functions and these value functions  
06:22 and at least in simplistic dynamic programming  we often assume that we know a model of of our  
06:31 environment and i’ll get to that in a minute so  the goal here the entire goal of reinforcement  
06:37 learning is to learn through trial and error  through experience what is the optimal policy  
06:44 to maximize your future rewards okay so notice  that this value function is a function of policy  
06:50 pi i want to learn the best possible policy that  always gives me the most value out of every board  
06:55 position out of every state and that’s a really  hard problem it’s easy to state the goal and it  
07:02 is really really hard to solve this problem that’s  why it’s been uh you know this growing field for  
07:07 100 years that’s why it’s still a growing field  because we have more powerful emerging techniques  
07:12 in machine learning to start to solve this problem  so that’s the framework i need you to know what  
07:18 the the policy is that is the set of kind of rules  or or uh you know controllers that i as an agent  
07:24 get to take to manipulate my environment the value  function tells me how valuable it is to be in a  
07:30 particular state so i might want to move myself  into that state so i need you to know kind of this  
07:36 nomenclature so now i can kind of show you how all  of these techniques are organized okay so that’s  
07:42 what i’m going to do for the rest of the video  is we’re going to talk through kind of the key  
07:46 organization of all of the different uh like  like mainstream types of reinforcement learning  
07:53 okay so the first biggest dichotomy is between  model-based and model-free reinforcement learning  
08:01 so if you actually have a good model of your  environment you have some you know markov decision  
08:08 process or some differential equation if you have  a good model to start with then you can work in  
08:13 this model based reinforcement learning world now  some people don’t actually consider what i’m about  
08:20 to talk about reinforcement learning but but i  do okay so for example if my environment is a  
08:28 markov decision process which means that there is  a probability kind of a deterministic probability  
08:35 that sounds like an oxymoron but if there is  a specified probability of moving from state s  
08:41 to the next state s prime given action a and  this probability function is known so it’s a  
08:48 it doesn’t depend on the history of your actions  and states it only depends on the current state  
08:52 and the current action determines a probability of  going to a next state s prime then uh two really  
08:59 really powerful techniques that i’m going to tell  you about to optimize the the policy function pi  
09:06 is policy iteration and value iteration these  allow you to essentially iteratively walk through  
09:14 um the game or or the markov decision process  taking actions that you think are going to  
09:19 be the best and then assessing what the  value of that action and state actually  
09:24 are and then kind of refining and iterating  the policy function and the value function  
09:29 so that is a really really powerful approach  if you have a model of your system you can  
09:34 run this kind of on a computer and and kind of  determine learn what the best policy and value is  
09:43 and this is kind of a special case of dynamic  programming that relies on the bellman optimality  
09:49 condition for the value function  so i’m going to do a whole lecture  
09:53 on this blue part right here we’re going to  talk about policy iteration and value iteration  
09:57 and how they are essentially dynamic using dynamic  programming on the value function which satisfies  
10:04 bellman’s optimality condition now that was for  probabilistic uh processes things where maybe  
10:10 you know like backgammon where there’s a dice  roll at every turn for deterministic systems like  
10:17 a robot system or a self-driving car if i think  about a human you know my reinforcement learning  
10:22 problem this is much more of a continuous control  problem in which case i might have some nonlinear  
10:28 differential equation x dot equals f of x comma u  and so the linear optimal control that we studied  
10:37 i guess in chapter 8 of the book so linear  quadratic regulators common filters things like  
10:44 that optimal linear control problems are special  cases of this optimal non-linear control problem  
10:52 with the hamilton jacobi bellman equation  again this relies on bellman optimality and  
10:58 you can use kind of dynamic programming ideas  to solve optimal nonlinear control problems  
11:05 like this now i’ll point out mathematically  this is a beautiful theory it’s powerful  
11:12 it’s been around for decades and it’s you know  kind of the textbook way of thinking about how to  
11:16 design optimal policies and optimal controllers  you know for markov decision processes and for  
11:22 non-linear control systems in practice actually  solving these things with dynamic programming  
11:28 ends up usually amounting to a brute force search  and it’s usually not scalable to high dimensional  
11:36 systems so typically it’s hard to do this optimal  hamilton jacobi bellman type non-linear control  
11:42 for an even moderately high dimensional system  you know you can do this for a three-dimensional  
11:47 system sometimes a five-dimensional system maybe  you know i’ve heard special cases with with  
11:52 machine learning you can do this maybe for a 10 or  100 dimensional system but you can’t do this for  
11:57 the nonlinear fluid flow equations which might  have you know a hundred thousand or a million  
12:01 dimensional differential equation when you write  it down on your computer so important caveat there  
12:07 but that’s model based control and a lot of what  we’re going to do in model free control uses ideas  
12:15 that we learned from model based control so even  though you know i don’t actually do a lot of this  
12:19 in my daily life with reinforcement learning  most of the time we don’t have a good model  
12:24 of our system for example in chess i don’t have  a model of my opponent for example or at least i  
12:29 can’t write it down mathematically as a markov  decision process so i can’t really use these  
12:35 techniques but a lot of what model free control  reinforcement learning is going to do is kind of  
12:42 approximate dynamic programming where you’re  simultaneously learning kind of the dynamics  
12:47 or learning to update these these functions  through trial and error without actually having  
12:52 a model and so in model-free reinforcement  learning kind of the major dichotomy here  
12:59 is between gradient-free and gradient-based  methods and i’ll tell you what this means in a  
13:06 little bit but for example if i can parameterize  my policy pi by some variables theta and i know  
13:15 kind of what the dependency with those variables  theta are i might be able to take the gradient  
13:20 of my reward function or my value function  with respect to those parameters directly  
13:25 and speed up the optimization okay so gradient  based if you if you can use it is usually going  
13:31 to be the fastest most efficient way to do  things but oftentimes again we don’t have  
13:36 gradient information we’re just playing games  we’re playing chess we’re playing go and i can’t  
13:41 compute the derivative of one game with respect  to another that’s hard for me at least to do  
13:46 and so within gradient free okay there’s a lot  of dichotomies here there’s a dichotomy of a  
13:51 dichotomy of a dichotomy within gradient free  control there is this idea of sometimes you can  
13:57 be off policy or on policy and it’s a really  important uh distinction what on policy means  
14:05 is that let’s say i’m playing a bunch of games  of chess i’m trying to learn an optimal policy  
14:10 function or an optimal value function or both by  playing games of chess and iteratively kind of  
14:16 refining my estimate of pi or a v what on policy  means is that i always play my best game possible  
14:25 whatever i think the value function is and  whatever i think my best policy possible is  
14:30 i’m always going to use that best policy as i play  my game and i’m going to always try to kind of get  
14:35 the most reward out of my system every game  i play that’s what it means to be on policy  
14:42 off policy means well maybe i’ll try some things  maybe maybe i know that my policy is suboptimal  
14:49 and so i’m just going to do some like random moves  occasionally that is called off policy because i  
14:54 think they’re sub-optimal but they might be really  valuable for learning information about the system  
15:01 uh so on policy methods include this  sarsa state action reward state action  
15:07 and there’s all of these variants of the sarsa  algorithm this on policy reinforcement learning  
15:13 and these tds mean temporal difference and mc  is monte carlo and so there’s this whole family  
15:20 of kind of gradient-free optimization techniques  that use different kind of amounts of history i’ll  
15:27 talk all about that that’s going to be a whole  other lecture is this this red box gradient free  
15:33 model free reinforcement learning and so the off  policy version of sarsa kind of this on policy  
15:41 set of algorithms there is an off policy variant  called q learning and so this quality function  
15:48 q is kind of the joint value if you  like of being in a particular state  
15:54 and taking a particular action a so this quality  function contains all of the information of my  
16:01 my optimal policy and the value function and both  of these can be derived from the quality function  
16:08 but the really important distinction is that when  we learn based on the quality function we don’t  
16:13 need a model for what my next state is going to be  this quality function kind of implicitly defines  
16:19 the value of you know based on where you’re going  to go in the future and so q learning is a really  
16:25 nice way of learning when you have no model and  you can take off policy information and learn  
16:31 from that you can take a sub-optimal controller  just to see what happens and still learn and get  
16:36 better policies and better value functions in the  future and that’s also really important if you  
16:42 want to do imitation learning if i want to just  watch other people play games of chess even though  
16:47 i don’t know what their value function is or what  their policy is with these off policy learning  
16:53 algorithms you can accumulate that information  into your estimate of the world and every bit of  
16:59 information you get improves your quality function  and it improves the next game you’re going to play  
17:05 so really powerful and i would say most of what we  do nowadays you know is kind of in this q learning  
17:11 world a lot a lot a lot of machine learning is  q learning reinforcement learning is q-learning  
17:18 and then the gradient-based algorithms i’m not  going to talk about it too much here but it’s  
17:22 essentially where you would actually update the  parameters of your policy or your value function  
17:26 or your q function directly using some kind of a  gradient optimization so if i can sum up all of my  
17:33 future rewards and it’s a function of the current  parameters theta that parameterize my policy  
17:40 then i might be able to use gradient optimization  things like newton’s steps and steepest descent  
17:44 things like that to get a good estimate  and this when i have the ability to do that  
17:50 is going to be way way faster uh than any of  these uh these gradient free methods and and  
17:55 even in term uh will be faster than dynamic  programming and so the last piece of this  
18:01 is kind of in the last 10 years we’ve had this  massive explosion of deep reinforcement learning  
18:08 a lot of this has been because of deep mind and  alphago you know demonstrating that machines  
18:14 computers can play atari games at human level  performance they can beat grand masters that go  
18:20 just incredibly impressive demonstrations  of reinforcement learning that now use deep  
18:25 neural networks either to learn a model where you  can then use model-based reinforcement learning  
18:31 or to represent these kind of model-free concepts  so you can have like a deep neural network for  
18:39 the quality function you can have a deep neural  network for the policy and then differentiate  
18:44 with respect to those network parameters uh using  kind of you know auto diff and back propagation  
18:50 uh to do gradient based optimization on your  policy network i would say that deep model  
18:56 predictive control this doesn’t exactly fit into  the reinforcement learning world but i would say  
19:02 you know it’s it’s morally very closely related  deep model predictive control allows you to solve  
19:07 these kind of hard optimal nonlinear problems  and then you can actually learn a policy based  
19:13 on what your model predictive controller actually  does you can essentially kind of codify that model  
19:18 predictive controller into a control policy and  finally uh actor critic methods um actor critic  
19:25 methods existed long before deep reinforcement  learning but nowadays they have kind of a renewed  
19:32 interest uh because you can you can uh you  can train these with with deep neural networks  
19:38 okay so that is the mile-high view as i see it of  the different categories of reinforcement learning  
19:46 is this comprehensive absolutely  not is it a hundred percent  
19:50 factually correct definitely not this is you  know a rough sketch of the main divides and  
19:56 things you need to think about when you’re  choosing a reinforcement learning algorithm  
20:00 if you have a model of your system you can use  dynamic programming based on bellman optimality  
20:05 if you don’t have a model of the system you can  either use gradient free or gradient based methods  
20:11 and then there’s on policy and off policy variants  depending on you know your specific needs it tends  
20:17 out to be that sarsa methods are more conservative  and q learning will tend to converge faster  
20:23 and then for all of these methods there are  ways of kind of making them more powerful  
20:27 and more flexible representations using uh deep  neural networks in kind of different focused ways  
20:38 okay so in the next few videos we’ll zoom into you  know this part here for markup decision processes  
20:43 how we do policy iteration and value iteration  we’ll actually derive the quality function uh here  
20:50 we’ll talk about model free control these kind  of gradient free methods on policy and off policy  
20:55 cue learning is one of the most important ones and  temporal difference learning actually has a lot of  
21:02 neuro science analog so how we learn in in our  animal brains people think you know is very  
21:09 closely related to these td learning policies  we’ll talk about how you do optimal nonlinear  
21:15 control with the hamilton jacobi bellman equation  uh we’ll talk very briefly about policy gradient  
21:21 optimization and then you know all of these  there are kind of deep learning things uh  
21:27 we’ll pepper it throughout with deep learning  or maybe i’ll have a whole lecture on on these  
21:31 deep learning methods so that’s all coming up  really excited to walk you through this thank you

ReinforcementLearningModels

20 COMMENTS

  1. I have Energy data and I need to implement RL on these data (Inverter) to achieve the best result (when charge/discharge battery, when is the best time to feed in grid,etc.) which algorithm should I use for that ?

  2. Superb. One of the things I always I struggle with when learning something is having a well structured map in my head of the topic and subtopics and this does an extremely good job of doing that. Many thanks.

LEAVE A REPLY

Please enter your comment!
Please enter your name here