Policy evaluation computes the value function
Iterative Evaluation
In the dynamic programming case, weโre given the environment dynamics
Monte Carlo Estimate
Without the environment dynamics, one approach is to run our policy and collect trajectories with rewards. Then, we approximate
Fitting this approximator with weights
and we train on the objective
Bootstrap Estimate
We can improve our above estimates for
Instead of taking a single Monte Carlo estimate for the trajectory after time
However, we donโt actually know the true
and we use the same objective as above. This reduces the variance of our approximation, but it introduces more bias since weโre using our approximation of
Finally, thereโs one small modification to the estimate in case we have an infinite-horizon problem. In the current setup, the value function will keep getting larger and larger due to the infinite horizon. To address this, we introduce a discount factor
This is analogous to putting a โtime pressureโ on our values, prioritizing sooner rewards over than equivalent later rewards.