Sarsa is an on-policy โ๏ธ Temporal Difference Learning method. The core update rule is
for all state-action to state-action sample transitions following our policy
Like in On-Policy, our policy needs to cover all state-action pairs in order to estimate the Q-function. Thus, we generally use ๐ฐ Epsilon-Greedy to choose our next action.
Expected Sarsa
Expected Sarsa eliminates the single-sample estimate for
but this also increases computational cost since we now need to consider all possible actions
Semi-Gradient Sarsa
If we approximate