Reinforcement learning deals with general learning problems in which an 🧸 Agent seeks to achieve some goal in an environment formalized as a 🌎 Markov Decision Process (and in rare cases a 🎰 Multi-Armed Bandit or 📖 Contextual Bandit). This system can be broken down into four key components:

A policy defines how the agent behaves by defining what the agent should do in response to an observation from the environment.
A reward signal defines our goal for the agent by designating what states or actions are desirable. Our agent aims to maximize its expected total reward; that is, for some parameters that define our policy , our objective is to maximize

Value functions define the expected total reward. Thus, these functions define the long-term value of the state or state-action pair.
A model of the environment estimates how the environment behaves. This component is optional, but learning it allows our agent to perform planning.

There’s a huge diversity of algorithms that balance stability, efficiency, and prior assumptions. Depending on the environment, specifically the size of the state space, these methods are either tabular or function approximations; for the former, we simply maintain values for each state or state-action pair whereas in the latter, due to a combinatorially large state space, we need to generalize our findings by using a function (like 🕸️ Multilayer Perceptron or ✏️ Linear Function Approximation) instead.

Generally, if we have a table or function with parameters , index (like state or state-action), and target , then a tabular update looks like

whereas a function update uses ⛰️ Gradient Descent,

Nevertheless, whether tabular or approximate, we can generally categorize them as learning or planning—depending on whether they’re model-free or model-based respectively.

Planning (Model-Based)

Planning methods use a model of the environment’s dynamics, , to augment learning or decision-making. Generally, the dynamics can either be used to learn the value functions (background planning) or to make decisions during execution (decision-time planning).

Background Planning

Background planning methods exploit the known probabilities to accurately estimate expectations over the environment.

🧨 Dynamic Programming is a classic approach to estimating value functions using the environment dynamics and bootstrapping. The two most common implementations are ♻️ Policy Iteration and 💎 Value Iteration.
🔮 Model Predictive Control and 💣 Dyna are methods that execute both model-learning and planning updates simultaneously.

Decision-Time Planning

Decision-time planning supplements the policy by using the dynamics to choose an action when given state .

🌲 Heuristic Search recursively checks the game tree and computes action-values using the dynamics probabilities.
🎳 Rollout methods optimize a rollout policy that explores possible future trajectories and estimate action-values via sampling.
🗺️ Monte Carlo Tree Search performs multiple rollouts directed toward trajectories that it estimates to have high value.

Learning (Model-Free)

Learning methods derive a policy directly from interacting with the environment without using the dynamics. There are a variety of different approaches: these algorithms can generally be grouped as action-value methods, policy gradients, or imitation learning.

Action-Value Methods

Action-value methods estimate the state-value and action-value functions and derive a policy indirectly.

🪙 Monte Carlo Control is an approximation of dynamic programming using only samples from the environment.
⌛️ Temporal Difference Learning combines dynamic programming and Monte Carlo control by learning from samples and bootstrapping updates. The most common instances of control are 🧭 Sarsa (on-policy) and 🚀 Q-Learning (off-policy).
🪜 N-Step Bootstrapping bridges TD learning and Monte Carlo estimates by balancing the weight on samples versus bootstrap estimates.
🎫 Eligibility Traces generalizes -step bootstrapping to average over all .

Policy Gradients

🚓 Policy Gradients optimize the policy directly rather than using the value functions. These are most commonly with function approximation techniques since approximation ensures that similar states get similar actions. Some common policy gradients are below:

🛠️ Monte Carlo Policy Gradient estimates baselined return through samples.
🚑 Off-Policy Policy Gradient derives computes gradients with samples that from another distribution.
🎭 Actor-Critic uses an action-value function estimate and updates both the policy and value function simultaneously.
🎩 Off-Policy Actor-Critic modifies actor-critic to work with out-of-distribution samples from a replay buffer.
🩰 A3C applies actor-critic updates asynchronously to stabilize training.
⚔️ Deterministic Policy Gradients and 🧨 DDPG optimize a deterministic policy for Q-learning-like offline updates.
✌️ TD3 improves DDPG training stability with more conservative value estimates.
🪶 Soft Actor-Critic trains an offline actor-critic through 🎲 Entropy Regularization.

🚜 Natural Policy Gradient observes that the policy updates changes the future state-action data distribution and adds a distribution-change constraint to the gradient problem.

🏦 Trust Region Policy Optimization improves the stability and efficiency of the natural policy gradient with conjugate gradient and line search for improvement.
📪 Proximal Policy Optimization simplifies TRPO by incorporating the distribution constraint directly in a clipped objective.

Imitation Learning

Imitation learning is a class of reinforcement learning algorithms that learn by observing an expert’s actions.

🐵 Behavioral Cloning solves the supervised learning problem by directly copying expert demonstrations.
🗡️ DAgger adds robustness to BC by incorporating manual feedback.

Explorer

♟️ Reinforcement Learning

Planning (Model-Based)

Background Planning

Decision-Time Planning

Learning (Model-Free)

Action-Value Methods

Policy Gradients

Imitation Learning

Table of Contents

Backlinks

Graph View

Explorer

♟️ Reinforcement Learning

Planning (Model-Based) §

Background Planning §

Decision-Time Planning §

Learning (Model-Free) §

Action-Value Methods §

Policy Gradients §

Imitation Learning §

Table of Contents

Backlinks

Graph View

Planning (Model-Based)

Background Planning

Decision-Time Planning

Learning (Model-Free)

Action-Value Methods

Policy Gradients

Imitation Learning