Monte Carlo methods for reinforcement learning can be used to approximate the exact ๐Ÿงจ Dynamic Programming algorithms without using environment dynamics. Without the distribution , we can perform estimates via sampling instead. Our updates are thus performed episode-by-episode rather than at every time step.

Evaluation

The Monte Carlo way to estimate state-value function is by averaging the returns experienced after visits to each state ,

๐Ÿ™๐Ÿ™

where ๐Ÿ™ is a indicator function. There are two approaches defining this indicator and thus how we count these returns:

  1. First-visit MC prediction only uses the return of the first visit to in the episode. This avoids biased estimates, making it more theoretically studied.
  2. Every-visit MC prediction tracks the returns for every visit to .

Control

Since we donโ€™t have the environment dynamics, our policy improvement step requires knowing the value of each action, not just the value of each state. Thus, in a full Monte Carlo control algorithm, we perform prediction for the Q-function instead, tracking averages across state-action pairs instead of just states:

๐Ÿ™๐Ÿ™

However, if our policy is deterministic, some state-action pairs will never be visited. This introduces the exploration-exploitation tradeoff, which can be resolved using either an on-policy or off-policy behavior method.

On-Policy

An on-policy Monte Carlo control method uses the same policy to estimate the value function and for improvement. To estimate our Q-function, we need to use a stochastic โ€œsoftโ€ policy instead, for example ๐Ÿ’ฐ Epsilon-Greedy, that allows us to sample all state-action pairs.

Our policy improvement guarantees still hold, but our final optimal policy wonโ€™t be greedy. Rather, this optimality can be thought of as a standard greedy deterministic policy acting in an environment that has a chance of ignoring our chosen action and picking a random one instead. Thus, we end up with the best soft policy, but not the best deterministic one.

Off-Policy

To train a deterministic policy, we have off-policy Monte Carlo, which uses a policy to generate estimates thatโ€™s separate from the one weโ€™re improving. The former is the behavioral policy , and the latter is the target policy .

can be any stochastic policy that visits all state-action pairs, and can be deterministic. The core problem with off-policy is that our estimates for our value must be based off of how would act, but our samples were generated from .

To address this, we evaluate the expectation with ๐Ÿช† Importance Sampling, which allows us to swap the sampling distribution by adding an importance weight inside the expectation. Specifically, the weight for a single transition is

For a return at time step , we would apply this weight to every transition in the trajectory, so the definition for state-value is

and similar for action-value we have

With ordinary importance sampling, we would estimate these as the average of the weighted returns calculated via first-visit.

Another alternative estimate is weighted importance sampling, which divides the sum of estimates by the sum of their weights rather than the number of samples. This results in an estimate that has higher bias but lower variance.

Policy Improvement

Similar to โ™ป๏ธ Policy Iteration, once we estimate our Q-function, we can generate a better deterministic policy