Monte Carlo methods for reinforcement learning can be used to approximate the exact ๐งจ Dynamic Programming algorithms without using environment dynamics. Without the distribution
Evaluation
The Monte Carlo way to estimate state-value function
where
- First-visit MC prediction only uses the return of the first visit to
in the episode. This avoids biased estimates, making it more theoretically studied. - Every-visit MC prediction tracks the returns for every visit to
.
Control
Since we donโt have the environment dynamics, our policy improvement step requires knowing the value of each action, not just the value of each state. Thus, in a full Monte Carlo control algorithm, we perform prediction for the Q-function
However, if our policy
On-Policy
An on-policy Monte Carlo control method uses the same policy
Our policy improvement guarantees still hold, but our final optimal policy wonโt be greedy. Rather, this optimality can be thought of as a standard greedy deterministic policy acting in an environment that has a
Off-Policy
To train a deterministic policy, we have off-policy Monte Carlo, which uses a policy to generate estimates thatโs separate from the one weโre improving. The former is the behavioral policy
To address this, we evaluate the expectation with ๐ช Importance Sampling, which allows us to swap the sampling distribution by adding an importance weight inside the expectation. Specifically, the weight for a single transition is
For a return at time step
and similar for action-value we have
With ordinary importance sampling, we would estimate these as the average of the weighted returns calculated via first-visit.
Another alternative estimate is weighted importance sampling, which divides the sum of estimates by the sum of their weights rather than the number of samples. This results in an estimate that has higher bias but lower variance.
Policy Improvement
Similar to โป๏ธ Policy Iteration, once we estimate our Q-function, we can generate a better deterministic policy