Value iteration is a reinforcement learning algorithm that estimates the Q-function and value function using the environmentโs dynamics. The algorithm performs the following two steps for every state
An alternative formula merges the two steps, repeating a single instruction instead:
This method implicitly finds a deterministic optimal policy since the second step is assuming that our policy takes the best action following
We can interpret value iteration as a simpler version of โป๏ธ Policy Iteration where we merge the policy improvement step into policy evaluation. It also follows the same convergence guarantee for the tabular case; the intuition is that our algorithm continuously contracts the distance between our current
Fitted Value Iteration
While the above assumes the
- Set
- Set
Unfortunately, when we introduce a neural network, our convergence guarantee doesnโt work. Our neural network effectively restricts the space of possible
Fitted Q-Iteration
Value iteration and fitted value iteration both require knowing the transition probability
- Set
- Set
Note that in the first step, we approximate
using a single sample