Policy iteration is a reinforcement learning approach that iteratively performs policy evaluation and policy improvement.
- Iterative Policy Evaluation computes the value function
for our policy . - Policy improvement learns a better policy using
.
In tabular settings, by repeating these two steps, policy iteration is guaranteed to converge to the optimal policy. The first step gives an accurate estimate for the current value, and the second step always produces a better policy given the values.
Policy Improvement
Policy improvement is based on the observation that for two deterministic policies
then
Given our current policy
Note that this is also equivalent to
Both the Q-function and advantage can be computed from our state-value function and the environment dynamics.
Alternative Formulation
Noting that we’re using the Q-function anyway in the policy improvement step, we can alternatively evaluate
Crucially, the difference between this update and the one for the value function is that our action