Value iteration is a reinforcement learning algorithm that estimates the Q-function and value function using the environmentโ€™s dynamics. The algorithm performs the following two steps for every state : 1.

An alternative formula merges the two steps, repeating a single instruction instead:

This method implicitly finds a deterministic optimal policy since the second step is assuming that our policy takes the best action following ,

We can interpret value iteration as a simpler version of โ™ป๏ธ Policy Iteration where we merge the policy improvement step into policy evaluation. It also follows the same convergence guarantee for the tabular case; the intuition is that our algorithm continuously contracts the distance between our current and the optimal , which ensures convergence.

Fitted Value Iteration

While the above assumes the is a table with discrete states , itโ€™s often not possible to use this tabular method when the state space is huge. Fitted value iteration replaces the table with a neural network , and our algorithm must learn to fit the network in the second step:

  1. Set
  1. Set

Unfortunately, when we introduce a neural network, our convergence guarantee doesnโ€™t work. Our neural network effectively restricts the space of possible functions, and if we contract toward and then fit our neural network on the desired update, itโ€™s possible to move further from the optimal parameters.

Fitted Q-Iteration

Value iteration and fitted value iteration both require knowing the transition probability . We can circumvent this by fitting the function instead as follows:

  1. Set
  1. Set

Note that in the first step, we approximate

using a single sample , and our value function is defined by the best action like above. Fitted Q-iteration now only requires tuples which can be generated by any policy, and thus it relaxes our reliance on . We can collect these tuples in a dataset before training or perform online updates; the latter gives us the function approximation version of ๐Ÿš€ Q-Learning.