Policy iteration is a reinforcement learning approach that iteratively performs policy evaluation and policy improvement.

  1. Iterative Policy Evaluation computes the value function for our policy .
  2. Policy improvement learns a better policy using .

In tabular settings, by repeating these two steps, policy iteration is guaranteed to converge to the optimal policy. The first step gives an accurate estimate for the current value, and the second step always produces a better policy given the values.

Policy Improvement

Policy improvement is based on the observation that for two deterministic policies and , if

then is a better policy than . That is, if taking the action chosen by and following afterwards is better than following directly, then is better than ,

Given our current policy and the state-value , we can thus find a better policy via

Note that this is also equivalent to

Both the Q-function and advantage can be computed from our state-value function and the environment dynamics.

Alternative Formulation

Noting that we’re using the Q-function anyway in the policy improvement step, we can alternatively evaluate directly. Then,

Crucially, the difference between this update and the one for the value function is that our action doesn’t have to be from our current policy. Thus, another way to estimate without transition dynamics is to fit it using tuples generated from any policy.