Rollout algorithms are decision-time planning methods that decide the next action by estimating action values via ๐Ÿช™ Monte Carlo Control. That is, using the environment dynamics, we simulate multiple trajectories that start at our current state and end at a terminal state; we can then estimate value as the average return.

Rather than optimizing a universal policy, we optimize a rollout policy that decides which trajectories we sample in MC control. We then improve by choosing the greedy action,

and modifying to repeat this decision in the future. This is similar to a step in โ™ป๏ธ Policy Iteration except that we update for this single state . However, unlike dynamic programming (background planning), our action-value estimates are discarded after choosing the action.