Welcome back so i’ve started this video lecture series on reinforcement learning and the last three videos were at a very high level kind of what is reinforcement learning how does it work uh what are some of the applications but we really didn’t dig into too many details on the actual

Algorithms of how you implement reinforcement learning in practice and so that’s what i’m actually going to do today and in this next part of this series is something i hope is going to be really really useful kind of for all of you which is the

The first thing is i’m going to kind of organize the different approaches of reinforcement learning This is a massive field that’s about 100 years old This merges neuroscience, behavioral science like Pavlov’s dog, optimization theory, optimal control think Bellman’s equation and the Hamilton Jacobi Bellman equation all the way to modern day deep reinforcement learning

Which is kind of how to use powerful machine learning techniques to solve these optimization problems and you’ll remember that in my view of reinforcement learning this is really at the intersection of machine learning and control theory so we’re essentially machine learning good effective control strategies to interact with in a an environment

So in this first lecture what i’m gonna do and i think i’m hoping that this is actually super useful for some of you is i’m going to talk through the organization of these different decisions you have to make and kind of how you can think about the landscape of reinforcement learning.

Before going on I want to mention this is actually a chapter in the new second edition of our book data driven science and engineering with myself and Nathan Kutz and reinforcement learning was one of the new chapters

I decided to write so this is a great excuse for me to get to learn more about reinforcement learning and it’s also a nice opportunity for me to kind of get to communicate more details to you

So if you want to download this chapter the link is here, and I’ll also put it in the comments below and i’ll have a link to the second edition of the book uh up soon as well probably in the comments

Good so um a new chapter you can follow along with all of the videos and each video kind of um you know follows follows the chapter good so before i get into that organizational chart of how you know all of these different

Types of reinforcement learning can be thought of i want to just do a really really quick recap of what is the reinforcement learning problem so in reinforcement learning you have an agent that gets to interact with the world or the environment through a set of actions sometimes

These are discrete actions sometimes these are continuous actions if i have a robot i might have a continuous action space whereas if i’m playing a game if i’m the you know the white pieces on a

Chess board then i have a discrete set of actions even though it might be kind of high dimensional and i observe the state of the system at each time step i get to observe the state of the system

And use that information uh to change my actions to try to maximize my current or future rewards uh through through playing and i’ll mention that in lots of applications for example in chess the reward structure might be quite sparse i might not get any feedback on whether or not i’m making

Good moves until the very end when i either win or lose tic-tac-toe backgammon checkers go are all kind of the same way and that delayed reward structure is one of the things that makes this reinforcement learning problem really really challenging it’s what makes uh you know learning

In animal systems also challenging if you want to teach your dog a trick you know they have to know kind of step by step what you want them to do and so you actually sometimes have to give them

Rewards at intermediate steps to train a behavior and so the agent their control strategy or their their policy is typically called pi and it basically is a probability of taking action a given a state s a current state s so this could be a deterministic policy it could be a probabilistic

Policy but essentially it’s a set of rules that determines what actions i as the agent take given uh what i sense uh in the environment to maximize my future rewards so that’s the policy and again usually this is written in a probabilistic framework because typically the environment

Is written as a probabilistic model and there is something called a value function so given um you know some policy pi that i take then i can associate a value with being in each of the states

Of this system essentially by what is my expected future reward add up all of my future rewards what’s the expectation of that and we’ll put in this little discount factor because future rewards might be less uh advantageous to me than than current rewards this is just something that people

Do in economic theory it’s kind of like a you know utility function and so for every policy pi there is a value associated with being in each of the given states s now again i’ll point out for for even reasonably sophisticated problems you can do this for tic-tac-toe you can enumerate all

Possible states and all possible actions and you can compute this value function kind of through brute force but even for moderately complicated games even like checkers let alone back game in her chess or go this state space the the the space of all possible states you could observe

Your system in is astronomically large i think it’s estimated that there’s 10 to the 80 plus um maybe it’s 10 to 180 africa it’s a huge number of possible chess boards even more possible go boards and so you can’t really actually enumerate this value function but it’s a good um kind

Of abstract function that we can think about are these policy functions and these value functions and at least in simplistic dynamic programming we often assume that we know a model of of our environment and i’ll get to that in a minute so the goal here the entire goal of reinforcement

Learning is to learn through trial and error through experience what is the optimal policy to maximize your future rewards okay so notice that this value function is a function of policy pi i want to learn the best possible policy that always gives me the most value out of every board

Position out of every state and that’s a really hard problem it’s easy to state the goal and it is really really hard to solve this problem that’s why it’s been uh you know this growing field for 100 years that’s why it’s still a growing field because we have more powerful emerging techniques

In machine learning to start to solve this problem so that’s the framework i need you to know what the the policy is that is the set of kind of rules or or uh you know controllers that i as an agent

Get to take to manipulate my environment the value function tells me how valuable it is to be in a particular state so i might want to move myself into that state so i need you to know kind of this

Nomenclature so now i can kind of show you how all of these techniques are organized okay so that’s what i’m going to do for the rest of the video is we’re going to talk through kind of the key organization of all of the different uh like like mainstream types of reinforcement learning

Okay so the first biggest dichotomy is between model-based and model-free reinforcement learning so if you actually have a good model of your environment you have some you know markov decision process or some differential equation if you have a good model to start with then you can work in

This model based reinforcement learning world now some people don’t actually consider what i’m about to talk about reinforcement learning but but i do okay so for example if my environment is a markov decision process which means that there is a probability kind of a deterministic probability

That sounds like an oxymoron but if there is a specified probability of moving from state s to the next state s prime given action a and this probability function is known so it’s a it doesn’t depend on the history of your actions and states it only depends on the current state

And the current action determines a probability of going to a next state s prime then uh two really really powerful techniques that i’m going to tell you about to optimize the the policy function pi is policy iteration and value iteration these allow you to essentially iteratively walk through

Um the game or or the markov decision process taking actions that you think are going to be the best and then assessing what the value of that action and state actually are and then kind of refining and iterating the policy function and the value function

So that is a really really powerful approach if you have a model of your system you can run this kind of on a computer and and kind of determine learn what the best policy and value is and this is kind of a special case of dynamic Programming that relies on the bellman optimality

Condition for the value function so i’m going to do a whole lecture on this blue part right here we’re going to talk about policy iteration and value iteration and how they are essentially dynamic using dynamic programming on the value function which satisfies

Bellman’s optimality condition now that was for probabilistic uh processes things where maybe you know like backgammon where there’s a dice roll at every turn for deterministic systems like a robot system or a self-driving car if i think about a human you know my reinforcement learning

Problem this is much more of a continuous control problem in which case i might have some nonlinear differential equation x dot equals f of x comma u and so the linear optimal control that we studied i guess in chapter 8 of the book so linear quadratic regulators common filters things like

That optimal linear control problems are special cases of this optimal non-linear control problem with the hamilton jacobi bellman equation again this relies on bellman optimality and you can use kind of dynamic programming ideas to solve optimal nonlinear control problems like this now i’ll point out mathematically this is a beautiful theory it’s powerful

It’s been around for decades and it’s you know kind of the textbook way of thinking about how to design optimal policies and optimal controllers you know for markov decision processes and for non-linear control systems in practice actually solving these things with dynamic programming

Ends up usually amounting to a brute force search and it’s usually not scalable to high dimensional systems so typically it’s hard to do this optimal hamilton jacobi bellman type non-linear control for an even moderately high dimensional system you know you can do this for a three-dimensional

System sometimes a five-dimensional system maybe you know i’ve heard special cases with with machine learning you can do this maybe for a 10 or 100 dimensional system but you can’t do this for the nonlinear fluid flow equations which might have you know a hundred thousand or a million

Dimensional differential equation when you write it down on your computer so important caveat there but that’s model based control and a lot of what we’re going to do in model free control uses ideas that we learned from model based control so even though you know i don’t actually do a lot of this

In my daily life with reinforcement learning most of the time we don’t have a good model of our system for example in chess i don’t have a model of my opponent for example or at least i can’t write it down mathematically as a markov decision process so i can’t really use these

Techniques but a lot of what model free control reinforcement learning is going to do is kind of approximate dynamic programming where you’re simultaneously learning kind of the dynamics or learning to update these these functions through trial and error without actually having

A model and so in model-free reinforcement learning kind of the major dichotomy here is between gradient-free and gradient-based methods and i’ll tell you what this means in a little bit but for example if i can parameterize my policy pi by some variables theta and i know

Kind of what the dependency with those variables theta are i might be able to take the gradient of my reward function or my value function with respect to those parameters directly and speed up the optimization okay so gradient based if you if you can use it is usually going

To be the fastest most efficient way to do things but oftentimes again we don’t have gradient information we’re just playing games we’re playing chess we’re playing go and i can’t compute the derivative of one game with respect to another that’s hard for me at least to do

And so within gradient free okay there’s a lot of dichotomies here there’s a dichotomy of a dichotomy of a dichotomy within gradient free control there is this idea of sometimes you can be off policy or on policy and it’s a really important uh distinction what on policy means

Is that let’s say i’m playing a bunch of games of chess i’m trying to learn an optimal policy function or an optimal value function or both by playing games of chess and iteratively kind of

Refining my estimate of pi or a v what on policy means is that i always play my best game possible whatever i think the value function is and whatever i think my best policy possible is

I’m always going to use that best policy as i play my game and i’m going to always try to kind of get the most reward out of my system every game i play that’s what it means to be on policy

Off policy means well maybe i’ll try some things maybe maybe i know that my policy is suboptimal and so i’m just going to do some like random moves occasionally that is called off policy because i think they’re sub-optimal but they might be really valuable for learning information about the system

Uh so on policy methods include this sarsa state action reward state action and there’s all of these variants of the sarsa Algorithm this on policy reinforcement learning and these tds mean temporal difference and mc is monte carlo and so there’s this whole family

Of kind of gradient-free optimization techniques that use different kind of amounts of history i’ll talk all about that that’s going to be a whole other lecture is this this red box gradient free model free reinforcement learning and so the off policy version of sarsa kind of this on policy

Set of algorithms there is an off policy variant called q learning and so this quality function q is kind of the joint value if you like of being in a particular state and taking a particular action a so this quality function contains all of the information of my

My optimal policy and the value function and both of these can be derived from the quality function but the really important distinction is that when we learn based on the quality function we don’t need a model for what my next state is going to be this quality function kind of implicitly defines

The value of you know based on where you’re going to go in the future and so q learning is a really nice way of learning when you have no model and you can take off policy information and learn

From that you can take a sub-optimal controller just to see what happens and still learn and get better policies and better value functions in the future and that’s also really important if you want to do imitation learning if i want to just watch other people play games of chess even though

I don’t know what their value function is or what their policy is with these off policy learning algorithms you can accumulate that information into your estimate of the world and every bit of information you get improves your quality function and it improves the next game you’re going to play

So really powerful and i would say most of what we do nowadays you know is kind of in this q learning world a lot a lot a lot of machine learning is q learning reinforcement learning is q-learning

And then the gradient-based algorithms i’m not going to talk about it too much here but it’s essentially where you would actually update the parameters of your policy or your value function or your q function directly using some kind of a gradient optimization so if i can sum up all of my

Future rewards and it’s a function of the current parameters theta that parameterize my policy then i might be able to use gradient optimization things like newton’s steps and steepest descent things like that to get a good estimate and this when i have the ability to do that

Is going to be way way faster uh than any of these uh these gradient free methods and and even in term uh will be faster than dynamic programming and so the last piece of this is kind of in the last 10 years we’ve had this massive explosion of deep reinforcement learning

A lot of this has been because of deep mind and alphago you know demonstrating that machines computers can play atari games at human level performance they can beat grand masters that go just incredibly impressive demonstrations of reinforcement learning that now use deep

Neural networks either to learn a model where you can then use model-based reinforcement learning or to represent these kind of model-free concepts so you can have like a deep neural network for the quality function you can have a deep neural network for the policy and then differentiate

With respect to those network parameters uh using kind of you know auto diff and back propagation uh to do gradient based optimization on your policy network i would say that deep model predictive control this doesn’t exactly fit into the reinforcement learning world but i would say

You know it’s it’s morally very closely related deep model predictive control allows you to solve these kind of hard optimal nonlinear problems and then you can actually learn a policy based on what your model predictive controller actually does you can essentially kind of codify that model

Predictive controller into a control policy and finally uh actor critic methods um actor critic methods existed long before deep reinforcement learning but nowadays they have kind of a renewed interest uh because you can you can uh you can train these with with deep neural networks

Okay so that is the mile-high view as i see it of the different categories of reinforcement learning is this comprehensive absolutely not is it a hundred percent factually correct definitely not this is you know a rough sketch of the main divides and

Things you need to think about when you’re choosing a reinforcement learning algorithm if you have a model of your system you can use dynamic programming based on bellman optimality if you don’t have a model of the system you can either use gradient free or gradient based methods

And then there’s on policy and off policy variants depending on you know your specific needs it tends out to be that sarsa methods are more conservative and q learning will tend to converge faster and then for all of these methods there are ways of kind of making them more powerful

And more flexible representations using uh deep neural networks in kind of different focused ways okay so in the next few videos we’ll zoom into you know this part here for markup decision processes how we do policy iteration and value iteration we’ll actually derive the quality function uh here

We’ll talk about model free control these kind of gradient free methods on policy and off policy cue learning is one of the most important ones and temporal difference learning actually has a lot of neuro science analog so how we learn in in our animal brains people think you know is very

Closely related to these td learning policies we’ll talk about how you do optimal nonlinear control with the hamilton jacobi bellman equation uh we’ll talk very briefly about policy gradient optimization and then you know all of these there are kind of deep learning things uh

We’ll pepper it throughout with deep learning or maybe i’ll have a whole lecture on on these deep learning methods so that’s all coming up really excited to walk you through this thank you

This video introduces the variety of methods for model-based and model-free reinforcement learning, including: dynamic programming, value and policy iteration, Q-learning, deep RL, TD-learning, SARSA, policy gradient optimization, among others.

Citable link for this video:

This is the overview in a series on reinforcement learning, following the new Chapter 11 from the 2nd edition of our book “Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control” by Brunton and Kutz

Book Website:

Book PDF:

RL Chapter:

Amazon:

Brunton Website: eigensteve.com

00:09 welcome back so i’ve started this video lecture series on reinforcement learning and the last

00:16 three videos were at a very high level kind of what is reinforcement learning how does it work uh

00:22 what are some of the applications but we really didn’t dig into too many details on the actual

00:27 algorithms of how you implement reinforcement learning in practice and so that’s what i’m

00:33 actually going to do today and in this next part of this series is something i hope is going to be

00:40 really really useful kind of for all of you which is the

00:45 The first thing is i’m going to kind of organize the different approaches of reinforcement learning

00:51 This is a massive field that’s about 100 years old

00:54 This merges neuroscience, behavioral science like Pavlov’s dog, optimization theory, optimal control

01:02 think Bellman’s equation and the Hamilton Jacobi Bellman equation

01:07 all the way to modern day deep reinforcement learning

01:11 which is kind of how to use powerful machine learning techniques to solve these optimization problems

01:17 and you’ll remember that in my view of reinforcement learning

01:22 this is really at the intersection of machine learning and control theory

01:27 so we’re essentially machine learning good effective control strategies to interact with in a an environment

01:34 So in this first lecture what i’m gonna do and i think i’m hoping that this is actually super useful for some of you

01:43 is i’m going to talk through the organization of these different decisions you have to make

01:47 and kind of how you can think about the landscape of reinforcement learning.

01:51 Before going on I want to mention this is actually a chapter in the new second edition of our book data driven science

01:59 and engineering with myself and Nathan Kutz and reinforcement learning was one of the new chapters

02:06 I decided to write so this is a great excuse for me to get to learn more about reinforcement learning

02:11 and it’s also a nice opportunity for me to kind of get to communicate more details to you

02:16 so if you want to download this chapter the link is here, and I’ll also put it in the comments below

02:22 and i’ll have a link to the second edition of the book uh up soon as well probably in the comments

02:28 good so um a new chapter you can follow along with all of the videos

02:32 and each video kind of um you know follows follows the chapter

02:37 good so before i get into that organizational chart of how you know all of these different

02:41 types of reinforcement learning can be thought of i want to just do a really really quick recap

02:47 of what is the reinforcement learning problem so in reinforcement learning you have an agent

02:53 that gets to interact with the world or the environment through a set of actions sometimes

02:58 these are discrete actions sometimes these are continuous actions if i have a robot i might have

03:03 a continuous action space whereas if i’m playing a game if i’m the you know the white pieces on a

03:09 chess board then i have a discrete set of actions even though it might be kind of high dimensional

03:16 and i observe the state of the system at each time step i get to observe the state of the system

03:21 and use that information uh to change my actions to try to maximize my current or future rewards

03:29 uh through through playing and i’ll mention that in lots of applications for example in chess

03:35 the reward structure might be quite sparse i might not get any feedback on whether or not i’m making

03:39 good moves until the very end when i either win or lose tic-tac-toe backgammon checkers

03:46 go are all kind of the same way and that delayed reward structure is one of the things that makes

03:51 this reinforcement learning problem really really challenging it’s what makes uh you know learning

03:57 in animal systems also challenging if you want to teach your dog a trick you know they have to know

04:03 kind of step by step what you want them to do and so you actually sometimes have to give them

04:09 rewards at intermediate steps to train a behavior and so the agent their control strategy or their

04:17 their policy is typically called pi and it basically is a probability of taking action a

04:24 given a state s a current state s so this could be a deterministic policy it could be a probabilistic

04:30 policy but essentially it’s a set of rules that determines what actions i as the agent take given

04:36 uh what i sense uh in the environment to maximize my future rewards so that’s the policy and again

04:45 usually this is written in a probabilistic framework because typically the environment

04:50 is written as a probabilistic model and there is something called a value function so given

04:57 um you know some policy pi that i take then i can associate a value with being in each of the states

05:04 of this system essentially by what is my expected future reward add up all of my future rewards

05:10 what’s the expectation of that and we’ll put in this little discount factor because future rewards

05:15 might be less uh advantageous to me than than current rewards this is just something that people

05:20 do in economic theory it’s kind of like a you know utility function and so for every policy pi

05:28 there is a value associated with being in each of the given states s now again i’ll point out

05:35 for for even reasonably sophisticated problems you can do this for tic-tac-toe you can enumerate all

05:41 possible states and all possible actions and you can compute this value function kind of through

05:45 brute force but even for moderately complicated games even like checkers let alone back game in

05:51 her chess or go this state space the the the space of all possible states you could observe

05:57 your system in is astronomically large i think it’s estimated that there’s 10 to the 80 plus um

06:04 maybe it’s 10 to 180 africa it’s a huge number of possible chess boards even more possible go boards

06:10 and so you can’t really actually enumerate this value function but it’s a good um kind

06:16 of abstract function that we can think about are these policy functions and these value functions

06:22 and at least in simplistic dynamic programming we often assume that we know a model of of our

06:31 environment and i’ll get to that in a minute so the goal here the entire goal of reinforcement

06:37 learning is to learn through trial and error through experience what is the optimal policy

06:44 to maximize your future rewards okay so notice that this value function is a function of policy

06:50 pi i want to learn the best possible policy that always gives me the most value out of every board

06:55 position out of every state and that’s a really hard problem it’s easy to state the goal and it

07:02 is really really hard to solve this problem that’s why it’s been uh you know this growing field for

07:07 100 years that’s why it’s still a growing field because we have more powerful emerging techniques

07:12 in machine learning to start to solve this problem so that’s the framework i need you to know what

07:18 the the policy is that is the set of kind of rules or or uh you know controllers that i as an agent

07:24 get to take to manipulate my environment the value function tells me how valuable it is to be in a

07:30 particular state so i might want to move myself into that state so i need you to know kind of this

07:36 nomenclature so now i can kind of show you how all of these techniques are organized okay so that’s

07:42 what i’m going to do for the rest of the video is we’re going to talk through kind of the key

07:46 organization of all of the different uh like like mainstream types of reinforcement learning

07:53 okay so the first biggest dichotomy is between model-based and model-free reinforcement learning

08:01 so if you actually have a good model of your environment you have some you know markov decision

08:08 process or some differential equation if you have a good model to start with then you can work in

08:13 this model based reinforcement learning world now some people don’t actually consider what i’m about

08:20 to talk about reinforcement learning but but i do okay so for example if my environment is a

08:28 markov decision process which means that there is a probability kind of a deterministic probability

08:35 that sounds like an oxymoron but if there is a specified probability of moving from state s

08:41 to the next state s prime given action a and this probability function is known so it’s a

08:48 it doesn’t depend on the history of your actions and states it only depends on the current state

08:52 and the current action determines a probability of going to a next state s prime then uh two really

08:59 really powerful techniques that i’m going to tell you about to optimize the the policy function pi

09:06 is policy iteration and value iteration these allow you to essentially iteratively walk through

09:14 um the game or or the markov decision process taking actions that you think are going to

09:19 be the best and then assessing what the value of that action and state actually

09:24 are and then kind of refining and iterating the policy function and the value function

09:29 so that is a really really powerful approach if you have a model of your system you can

09:34 run this kind of on a computer and and kind of determine learn what the best policy and value is

09:43 and this is kind of a special case of dynamic programming that relies on the bellman optimality

09:49 condition for the value function so i’m going to do a whole lecture

09:53 on this blue part right here we’re going to talk about policy iteration and value iteration

09:57 and how they are essentially dynamic using dynamic programming on the value function which satisfies

10:04 bellman’s optimality condition now that was for probabilistic uh processes things where maybe

10:10 you know like backgammon where there’s a dice roll at every turn for deterministic systems like

10:17 a robot system or a self-driving car if i think about a human you know my reinforcement learning

10:22 problem this is much more of a continuous control problem in which case i might have some nonlinear

10:28 differential equation x dot equals f of x comma u and so the linear optimal control that we studied

10:37 i guess in chapter 8 of the book so linear quadratic regulators common filters things like

10:44 that optimal linear control problems are special cases of this optimal non-linear control problem

10:52 with the hamilton jacobi bellman equation again this relies on bellman optimality and

10:58 you can use kind of dynamic programming ideas to solve optimal nonlinear control problems

11:05 like this now i’ll point out mathematically this is a beautiful theory it’s powerful

11:12 it’s been around for decades and it’s you know kind of the textbook way of thinking about how to

11:16 design optimal policies and optimal controllers you know for markov decision processes and for

11:22 non-linear control systems in practice actually solving these things with dynamic programming

11:28 ends up usually amounting to a brute force search and it’s usually not scalable to high dimensional

11:36 systems so typically it’s hard to do this optimal hamilton jacobi bellman type non-linear control

11:42 for an even moderately high dimensional system you know you can do this for a three-dimensional

11:47 system sometimes a five-dimensional system maybe you know i’ve heard special cases with with

11:52 machine learning you can do this maybe for a 10 or 100 dimensional system but you can’t do this for

11:57 the nonlinear fluid flow equations which might have you know a hundred thousand or a million

12:01 dimensional differential equation when you write it down on your computer so important caveat there

12:07 but that’s model based control and a lot of what we’re going to do in model free control uses ideas

12:15 that we learned from model based control so even though you know i don’t actually do a lot of this

12:19 in my daily life with reinforcement learning most of the time we don’t have a good model

12:24 of our system for example in chess i don’t have a model of my opponent for example or at least i

12:29 can’t write it down mathematically as a markov decision process so i can’t really use these

12:35 techniques but a lot of what model free control reinforcement learning is going to do is kind of

12:42 approximate dynamic programming where you’re simultaneously learning kind of the dynamics

12:47 or learning to update these these functions through trial and error without actually having

12:52 a model and so in model-free reinforcement learning kind of the major dichotomy here

12:59 is between gradient-free and gradient-based methods and i’ll tell you what this means in a

13:06 little bit but for example if i can parameterize my policy pi by some variables theta and i know

13:15 kind of what the dependency with those variables theta are i might be able to take the gradient

13:20 of my reward function or my value function with respect to those parameters directly

13:25 and speed up the optimization okay so gradient based if you if you can use it is usually going

13:31 to be the fastest most efficient way to do things but oftentimes again we don’t have

13:36 gradient information we’re just playing games we’re playing chess we’re playing go and i can’t

13:41 compute the derivative of one game with respect to another that’s hard for me at least to do

13:46 and so within gradient free okay there’s a lot of dichotomies here there’s a dichotomy of a

13:51 dichotomy of a dichotomy within gradient free control there is this idea of sometimes you can

13:57 be off policy or on policy and it’s a really important uh distinction what on policy means

14:05 is that let’s say i’m playing a bunch of games of chess i’m trying to learn an optimal policy

14:10 function or an optimal value function or both by playing games of chess and iteratively kind of

14:16 refining my estimate of pi or a v what on policy means is that i always play my best game possible

14:25 whatever i think the value function is and whatever i think my best policy possible is

14:30 i’m always going to use that best policy as i play my game and i’m going to always try to kind of get

14:35 the most reward out of my system every game i play that’s what it means to be on policy

14:42 off policy means well maybe i’ll try some things maybe maybe i know that my policy is suboptimal

14:49 and so i’m just going to do some like random moves occasionally that is called off policy because i

14:54 think they’re sub-optimal but they might be really valuable for learning information about the system

15:01 uh so on policy methods include this sarsa state action reward state action

15:07 and there’s all of these variants of the sarsa algorithm this on policy reinforcement learning

15:13 and these tds mean temporal difference and mc is monte carlo and so there’s this whole family

15:20 of kind of gradient-free optimization techniques that use different kind of amounts of history i’ll

15:27 talk all about that that’s going to be a whole other lecture is this this red box gradient free

15:33 model free reinforcement learning and so the off policy version of sarsa kind of this on policy

15:41 set of algorithms there is an off policy variant called q learning and so this quality function

15:48 q is kind of the joint value if you like of being in a particular state

15:54 and taking a particular action a so this quality function contains all of the information of my

16:01 my optimal policy and the value function and both of these can be derived from the quality function

16:08 but the really important distinction is that when we learn based on the quality function we don’t

16:13 need a model for what my next state is going to be this quality function kind of implicitly defines

16:19 the value of you know based on where you’re going to go in the future and so q learning is a really

16:25 nice way of learning when you have no model and you can take off policy information and learn

16:31 from that you can take a sub-optimal controller just to see what happens and still learn and get

16:36 better policies and better value functions in the future and that’s also really important if you

16:42 want to do imitation learning if i want to just watch other people play games of chess even though

16:47 i don’t know what their value function is or what their policy is with these off policy learning

16:53 algorithms you can accumulate that information into your estimate of the world and every bit of

16:59 information you get improves your quality function and it improves the next game you’re going to play

17:05 so really powerful and i would say most of what we do nowadays you know is kind of in this q learning

17:11 world a lot a lot a lot of machine learning is q learning reinforcement learning is q-learning

17:18 and then the gradient-based algorithms i’m not going to talk about it too much here but it’s

17:22 essentially where you would actually update the parameters of your policy or your value function

17:26 or your q function directly using some kind of a gradient optimization so if i can sum up all of my

17:33 future rewards and it’s a function of the current parameters theta that parameterize my policy

17:40 then i might be able to use gradient optimization things like newton’s steps and steepest descent

17:44 things like that to get a good estimate and this when i have the ability to do that

17:50 is going to be way way faster uh than any of these uh these gradient free methods and and

17:55 even in term uh will be faster than dynamic programming and so the last piece of this

18:01 is kind of in the last 10 years we’ve had this massive explosion of deep reinforcement learning

18:08 a lot of this has been because of deep mind and alphago you know demonstrating that machines

18:14 computers can play atari games at human level performance they can beat grand masters that go

18:20 just incredibly impressive demonstrations of reinforcement learning that now use deep

18:25 neural networks either to learn a model where you can then use model-based reinforcement learning

18:31 or to represent these kind of model-free concepts so you can have like a deep neural network for

18:39 the quality function you can have a deep neural network for the policy and then differentiate

18:44 with respect to those network parameters uh using kind of you know auto diff and back propagation

18:50 uh to do gradient based optimization on your policy network i would say that deep model

18:56 predictive control this doesn’t exactly fit into the reinforcement learning world but i would say

19:02 you know it’s it’s morally very closely related deep model predictive control allows you to solve

19:07 these kind of hard optimal nonlinear problems and then you can actually learn a policy based

19:13 on what your model predictive controller actually does you can essentially kind of codify that model

19:18 predictive controller into a control policy and finally uh actor critic methods um actor critic

19:25 methods existed long before deep reinforcement learning but nowadays they have kind of a renewed

19:32 interest uh because you can you can uh you can train these with with deep neural networks

19:38 okay so that is the mile-high view as i see it of the different categories of reinforcement learning

19:46 is this comprehensive absolutely not is it a hundred percent

19:50 factually correct definitely not this is you know a rough sketch of the main divides and

19:56 things you need to think about when you’re choosing a reinforcement learning algorithm

20:00 if you have a model of your system you can use dynamic programming based on bellman optimality

20:05 if you don’t have a model of the system you can either use gradient free or gradient based methods

20:11 and then there’s on policy and off policy variants depending on you know your specific needs it tends

20:17 out to be that sarsa methods are more conservative and q learning will tend to converge faster

20:23 and then for all of these methods there are ways of kind of making them more powerful

20:27 and more flexible representations using uh deep neural networks in kind of different focused ways

20:38 okay so in the next few videos we’ll zoom into you know this part here for markup decision processes

20:43 how we do policy iteration and value iteration we’ll actually derive the quality function uh here

20:50 we’ll talk about model free control these kind of gradient free methods on policy and off policy

20:55 cue learning is one of the most important ones and temporal difference learning actually has a lot of

21:02 neuro science analog so how we learn in in our animal brains people think you know is very

21:09 closely related to these td learning policies we’ll talk about how you do optimal nonlinear

21:15 control with the hamilton jacobi bellman equation uh we’ll talk very briefly about policy gradient

21:21 optimization and then you know all of these there are kind of deep learning things uh

21:27 we’ll pepper it throughout with deep learning or maybe i’ll have a whole lecture on on these

21:31 deep learning methods so that’s all coming up really excited to walk you through this thank you

ReinforcementLearningModels

0:00 Intro

3:00 Background

7:54 Model & Model-Free Reinforcement Learning (RL)

8:29 Markov Decision Process (MDP)

10:25 Nonlinear Dynamics

13:02 Gradient & Gradient-Free RL

14:05 Off-Policy (Q Learning) & On-Policy (SARSA) RL

17:23 Policy Gradient Optimization

18:05 Deep RL

I have Energy data and I need to implement RL on these data (Inverter) to achieve the best result (when charge/discharge battery, when is the best time to feed in grid,etc.) which algorithm should I use for that ?

Actor critic should be in the policy gradient optimization no?

Steve, you are amazing.

The dichotomy break down are so awesome…

Great presentation 👏

Unironically, Steve's lectures are a highly optimized way of teaching.

Such a high quality course and a free book in description? You're awesome!!

This is amazing. Thank you, steve!

Wow! Steve, you've managed to break this all down into bite-sized chunks. Thank you 🙏

Great video! Though imho the on/off-policy distinction explained at 14:24 might be a bit misleading. I believe both on/off policy can explore sub-optimal actions with something like epsilon-greedy.

Coming to this video after a while. Really great video, thank you!!

hi steve, I think actor critic are usually considered model-free

Sir, it model free algorithm uses Marakove decision proce (MDP)?

Superb. One of the things I always I struggle with when learning something is having a well structured map in my head of the topic and subtopics and this does an extremely good job of doing that. Many thanks.

Dear Steve, great explanation. However, just wanna confirm: I thought Actor-Critic is a model-free model?

I deeply appreciate the quality of knowledge you are providing to the community. please continue to democratise knowledge.

Dear steve its amazing category to classify the reinforcement learning thanks alot

Dear steve we can use ls algorithm instead of gradient algorithm isn’t it

Dear professor, please explain to us about how to use reinforcement learning to tune pid gains ❤️

Im looking forward to hearing from you

Sincerely mohammad