A reinforcement learning agent aims to maximize total reward by performing actions in an environment, formally a ๐ŸŒŽ Markov Decision Process. To do so, it leverages (one or more) policies, value functions, and models of the world.

Policy

A policy defines how our agent acts in the environment. Formally, itโ€™s a mapping from states to the distribution over possible actions,

where are the parameters defining our policy.

One nuance is that while some environments provide us their full state (for example, in games), other situations like the real world only allow a partial view of the state through sensors, so our policy is conditioned on observations instead. In that case, our environment functions as a ๐Ÿช POMDP that makes a distinction between observations available to the agent and the internal state of the world.

Objective

The crucial objective in reinforcement learning is to maximize expected total reward, not just the immediate reward (like a greedy algorithm). We quantify the total reward as our return,

where is a discount factor that represents how much we value current reward relative to future rewards. If our horizon is infinite, but (to avoid infinite returns); otherwise, is a finite value that denotes our termination time.

For some trajectory defined by our policy ,

the expected return is given by

and our objective is to find the policy that maximizes this value.

Value Functions

By following the policy, our agent will get some return. We can quantify the expected return of policy that starts from state via the state-value function

Similarly, we can define the expected return of following after taking action at state with the action-value function (or Q-function)

Note that the action-value function is related to the state-value function by

Note that while the Q-function captures the value of an action on state , its value is the combination of the state value plus the additional value of the action. In order to decouple these the latter, we have the advantage function

that directly measures the extra reward from taking action .

The state-value and action-value functions can also be expressed recursively via the ๐Ÿ”” Bellman Equation. This equality is the basis of many reinforcement learning techniques.

Occupancy Measure

The occupancy measure is the stationary distribution of running our policy in the environment. Formally,

where is dependent on both the environment dynamics and the policy. Note that we can conversely define our policy in terms of the occupancy measure,

meaning that the policy and occupancy measure share a one-to-one correspondence.

If we just want the stationary distribution of states, we can marginalize over the occupancy measure to get the on-policy distribution