A reinforcement learning agent aims to maximize total reward by performing actions in an environment, formally a ๐ Markov Decision Process. To do so, it leverages (one or more) policies, value functions, and models of the world.
Policy
A policy defines how our agent acts in the environment. Formally, itโs a mapping from states to the distribution over possible actions,
where
One nuance is that while some environments provide us their full state (for example, in games), other situations like the real world only allow a partial view of the state through sensors, so our policy is conditioned on observations
Objective
The crucial objective in reinforcement learning is to maximize expected total reward, not just the immediate reward (like a greedy algorithm). We quantify the total reward as our return,
where
For some trajectory
the expected return is given by
and our objective is to find the policy that maximizes this value.
Value Functions
By following the policy, our agent will get some return. We can quantify the expected return of policy
Similarly, we can define the expected return of following
Note that the action-value function is related to the state-value function by
Note that while the Q-function captures the value of an action
that directly measures the extra reward from taking action
The state-value and action-value functions can also be expressed recursively via the ๐ Bellman Equation. This equality is the basis of many reinforcement learning techniques.
Occupancy Measure
The occupancy measure
where
meaning that the policy and occupancy measure share a one-to-one correspondence.
If we just want the stationary distribution of states, we can marginalize over the occupancy measure to get the on-policy distribution