Reinforcement learning deals with general learning problems in which an ๐Ÿงธ Agent seeks to achieve some goal in an environment formalized as a ๐ŸŒŽ Markov Decision Process (and in rare cases a ๐ŸŽฐ Multi-Armed Bandit or ๐Ÿ“– Contextual Bandit). This system can be broken down into four key components:

  1. A policy defines how the agent behaves by defining what the agent should do in response to an observation from the environment.
  2. A reward signal defines our goal for the agent by designating what states or actions are desirable. Our agent aims to maximize its expected total reward; that is, for some parameters that define our policy , our objective is to maximize
  1. Value functions define the expected total reward. Thus, these functions define the long-term value of the state or state-action pair.
  2. A model of the environment estimates how the environment behaves. This component is optional, but learning it allows our agent to perform planning.

Thereโ€™s a huge diversity of algorithms that balance stability, efficiency, and prior assumptions. Depending on the environment, specifically the size of the state space, these methods are either tabular or function approximations; for the former, we simply maintain values for each state or state-action pair whereas in the latter, due to a combinatorially large state space, we need to generalize our findings by using a function (like ๐Ÿ•ธ๏ธ Multilayer Perceptron or โœ๏ธ Linear Function Approximation) instead.

Generally, if we have a table or function with parameters , index (like state or state-action), and target , then a tabular update looks like

whereas a function update uses โ›ฐ๏ธ Gradient Descent,

Nevertheless, whether tabular or approximate, we can generally categorize them as learning or planningโ€”depending on whether theyโ€™re model-free or model-based respectively.

Planning (Model-Based)

Planning methods use a model of the environmentโ€™s dynamics, , to augment learning or decision-making. Generally, the dynamics can either be used to learn the value functions (background planning) or to make decisions during execution (decision-time planning).

Background Planning

Background planning methods exploit the known probabilities to accurately estimate expectations over the environment.

  1. ๐Ÿงจ Dynamic Programming is a classic approach to estimating value functions using the environment dynamics and bootstrapping. The two most common implementations are โ™ป๏ธ Policy Iteration and ๐Ÿ’Ž Value Iteration.
  2. ๐Ÿ”ฎ Model Predictive Control and ๐Ÿ’ฃ Dyna are methods that execute both model-learning and planning updates simultaneously.

Decision-Time Planning

Decision-time planning supplements the policy by using the dynamics to choose an action when given state .

  1. ๐ŸŒฒ Heuristic Search recursively checks the game tree and computes action-values using the dynamics probabilities.
  2. ๐ŸŽณ Rollout methods optimize a rollout policy that explores possible future trajectories and estimate action-values via sampling.
  3. ๐Ÿ—บ๏ธ Monte Carlo Tree Search performs multiple rollouts directed toward trajectories that it estimates to have high value.

Learning (Model-Free)

Learning methods derive a policy directly from interacting with the environment without using the dynamics. There are a variety of different approaches: these algorithms can generally be grouped as action-value methods, policy gradients, or imitation learning.

Action-Value Methods

Action-value methods estimate the state-value and action-value functions and derive a policy indirectly.

  1. ๐Ÿช™ Monte Carlo Control is an approximation of dynamic programming using only samples from the environment.
  2. โŒ›๏ธ Temporal Difference Learning combines dynamic programming and Monte Carlo control by learning from samples and bootstrapping updates. The most common instances of control are ๐Ÿงญ Sarsa (on-policy) and ๐Ÿš€ Q-Learning (off-policy).
  3. ๐Ÿชœ N-Step Bootstrapping bridges TD learning and Monte Carlo estimates by balancing the weight on samples versus bootstrap estimates.
  4. ๐ŸŽซ Eligibility Traces generalizes -step bootstrapping to average over all .

Policy Gradients

๐Ÿš“ Policy Gradients optimize the policy directly rather than using the value functions. These are most commonly with function approximation techniques since approximation ensures that similar states get similar actions. Some common policy gradients are below:

  1. ๐Ÿ› ๏ธ Monte Carlo Policy Gradient estimates baselined return through samples.
  2. ๐Ÿš‘ Off-Policy Policy Gradient derives computes gradients with samples that from another distribution.
  3. ๐ŸŽญ Actor-Critic uses an action-value function estimate and updates both the policy and value function simultaneously.
  4. ๐ŸŽฉ Off-Policy Actor-Critic modifies actor-critic to work with out-of-distribution samples from a replay buffer.
  5. ๐Ÿฉฐ A3C applies actor-critic updates asynchronously to stabilize training.
  6. โš”๏ธ Deterministic Policy Gradients and ๐Ÿงจ DDPG optimize a deterministic policy for Q-learning-like offline updates.
  7. โœŒ๏ธ TD3 improves DDPG training stability with more conservative value estimates.
  8. ๐Ÿชถ Soft Actor-Critic trains an offline actor-critic through ๐ŸŽฒ Entropy Regularization.

๐Ÿšœ Natural Policy Gradient observes that the policy updates changes the future state-action data distribution and adds a distribution-change constraint to the gradient problem.

  1. ๐Ÿฆ Trust Region Policy Optimization improves the stability and efficiency of the natural policy gradient with conjugate gradient and line search for improvement.
  2. ๐Ÿ“ช Proximal Policy Optimization simplifies TRPO by incorporating the distribution constraint directly in a clipped objective.

Imitation Learning

Imitation learning is a class of reinforcement learning algorithms that learn by observing an expertโ€™s actions.

  1. ๐Ÿต Behavioral Cloning solves the supervised learning problem by directly copying expert demonstrations.
  2. ๐Ÿ—ก๏ธ DAgger adds robustness to BC by incorporating manual feedback.