Reinforcement learning deals with general learning problems in which an ๐งธ Agent seeks to achieve some goal in an environment formalized as a ๐ Markov Decision Process (and in rare cases a ๐ฐ Multi-Armed Bandit or ๐ Contextual Bandit). This system can be broken down into four key components:
- A policy
defines how the agent behaves by defining what the agent should do in response to an observation from the environment. - A reward signal defines our goal for the agent by designating what states or actions are desirable. Our agent aims to maximize its expected total reward; that is, for some parameters
that define our policy , our objective is to maximize
- Value functions define the expected total reward. Thus, these functions define the long-term value of the state or state-action pair.
- A model of the environment estimates how the environment behaves. This component is optional, but learning it allows our agent to perform planning.
Thereโs a huge diversity of algorithms that balance stability, efficiency, and prior assumptions. Depending on the environment, specifically the size of the state space, these methods are either tabular or function approximations; for the former, we simply maintain values for each state or state-action pair whereas in the latter, due to a combinatorially large state space, we need to generalize our findings by using a function (like ๐ธ๏ธ Multilayer Perceptron or โ๏ธ Linear Function Approximation) instead.
Generally, if we have a table or function
whereas a function update uses โฐ๏ธ Gradient Descent,
Nevertheless, whether tabular or approximate, we can generally categorize them as learning or planningโdepending on whether theyโre model-free or model-based respectively.
Planning (Model-Based)
Planning methods use a model of the environmentโs dynamics,
Background Planning
Background planning methods exploit the known probabilities to accurately estimate expectations over the environment.
- ๐งจ Dynamic Programming is a classic approach to estimating value functions using the environment dynamics and bootstrapping. The two most common implementations are โป๏ธ Policy Iteration and ๐ Value Iteration.
- ๐ฎ Model Predictive Control and ๐ฃ Dyna are methods that execute both model-learning and planning updates simultaneously.
Decision-Time Planning
Decision-time planning supplements the policy by using the dynamics to choose an action
- ๐ฒ Heuristic Search recursively checks the game tree and computes action-values using the dynamics probabilities.
- ๐ณ Rollout methods optimize a rollout policy that explores possible future trajectories and estimate action-values via sampling.
- ๐บ๏ธ Monte Carlo Tree Search performs multiple rollouts directed toward trajectories that it estimates to have high value.
Learning (Model-Free)
Learning methods derive a policy directly from interacting with the environment without using the dynamics. There are a variety of different approaches: these algorithms can generally be grouped as action-value methods, policy gradients, or imitation learning.
Action-Value Methods
Action-value methods estimate the state-value and action-value functions and derive a policy indirectly.
- ๐ช Monte Carlo Control is an approximation of dynamic programming using only samples from the environment.
- โ๏ธ Temporal Difference Learning combines dynamic programming and Monte Carlo control by learning from samples and bootstrapping updates. The most common instances of control are ๐งญ Sarsa (on-policy) and ๐ Q-Learning (off-policy).
- ๐ช N-Step Bootstrapping bridges TD learning and Monte Carlo estimates by balancing the weight on samples versus bootstrap estimates.
- ๐ซ Eligibility Traces generalizes
-step bootstrapping to average over all .
Policy Gradients
๐ Policy Gradients optimize the policy directly rather than using the value functions. These are most commonly with function approximation techniques since approximation ensures that similar states get similar actions. Some common policy gradients are below:
- ๐ ๏ธ Monte Carlo Policy Gradient estimates baselined return through samples.
- ๐ Off-Policy Policy Gradient derives computes gradients with samples that from another distribution.
- ๐ญ Actor-Critic uses an action-value function estimate and updates both the policy and value function simultaneously.
- ๐ฉ Off-Policy Actor-Critic modifies actor-critic to work with out-of-distribution samples from a replay buffer.
- ๐ฉฐ A3C applies actor-critic updates asynchronously to stabilize training.
- โ๏ธ Deterministic Policy Gradients and ๐งจ DDPG optimize a deterministic policy for Q-learning-like offline updates.
- โ๏ธ TD3 improves DDPG training stability with more conservative value estimates.
- ๐ชถ Soft Actor-Critic trains an offline actor-critic through ๐ฒ Entropy Regularization.
๐ Natural Policy Gradient observes that the policy updates changes the future state-action data distribution and adds a distribution-change constraint to the gradient problem.
- ๐ฆ Trust Region Policy Optimization improves the stability and efficiency of the natural policy gradient with conjugate gradient and line search for improvement.
- ๐ช Proximal Policy Optimization simplifies TRPO by incorporating the distribution constraint directly in a clipped objective.
Imitation Learning
Imitation learning is a class of reinforcement learning algorithms that learn by observing an expertโs actions.
- ๐ต Behavioral Cloning solves the supervised learning problem by directly copying expert demonstrations.
- ๐ก๏ธ DAgger adds robustness to BC by incorporating manual feedback.