Real-time dynamic programming (RTDP) is an asynchronous variant of ๐ Value Iteration that selects states via trajectory sampling rather than an overall sweep over all states. Specifically, it follows an optimal partial policy (for example, with ๐ฐ Epsilon-Greedy) defined by the current value function, and at each state, it performs the value iteration update step
Convergence
Convergence is guaranteed in problems with reachable goal states and purely negative-reward transitions across non-goal states. The advantage over standard dynamic programming methods is that to reach this convergence, RTDP will only visit states that are relevant to the problemโones that are reachable and validโrather than sweeping over the entire state space.