Policy evaluation computes the value function for a certain policy . This is either done directly if weโ€™re given the environment dynamics or approximately with some estimation method.

Iterative Evaluation

In the dynamic programming case, weโ€™re given the environment dynamics and can solve the ๐Ÿ”” Bellman Equation directly. However, this system of equations is usually too computationally expensive, so we can instead perform an iterative approach,

Monte Carlo Estimate

Without the environment dynamics, one approach is to run our policy and collect trajectories with rewards. Then, we approximate using a function approximator

Fitting this approximator with weights is a supervised learning problem with training data consisting of state-reward pairs

and we train on the objective

Bootstrap Estimate

We can improve our above estimates for by observing that

Instead of taking a single Monte Carlo estimate for the trajectory after time , we now have a single-sample estimate for just the next reward and use the expectation for the trajectory after .

However, we donโ€™t actually know the true , so we replace it with our function approximator. Then, our training data is

and we use the same objective as above. This reduces the variance of our approximation, but it introduces more bias since weโ€™re using our approximation of .

Finally, thereโ€™s one small modification to the estimate in case we have an infinite-horizon problem. In the current setup, the value function will keep getting larger and larger due to the infinite horizon. To address this, we introduce a discount factor that slightly reduces the value of the next time step, giving us

This is analogous to putting a โ€œtime pressureโ€ on our values, prioritizing sooner rewards over than equivalent later rewards.