Real-time dynamic programming (RTDP) is an asynchronous variant of ๐Ÿ’Ž Value Iteration that selects states via trajectory sampling rather than an overall sweep over all states. Specifically, it follows an optimal partial policy (for example, with ๐Ÿ’ฐ Epsilon-Greedy) defined by the current value function, and at each state, it performs the value iteration update step

Convergence

Convergence is guaranteed in problems with reachable goal states and purely negative-reward transitions across non-goal states. The advantage over standard dynamic programming methods is that to reach this convergence, RTDP will only visit states that are relevant to the problemโ€”ones that are reachable and validโ€”rather than sweeping over the entire state space.