Sarsa is an on-policy โŒ›๏ธ Temporal Difference Learning method. The core update rule is

for all state-action to state-action sample transitions following our policy . This update requires the tuple hence the name.

Like in On-Policy, our policy needs to cover all state-action pairs in order to estimate the Q-function. Thus, we generally use ๐Ÿ’ฐ Epsilon-Greedy to choose our next action.

Expected Sarsa

Expected Sarsa eliminates the single-sample estimate for by replacing with an expectation. Our update rule is

but this also increases computational cost since we now need to consider all possible actions .

Semi-Gradient Sarsa

If we approximate with a function , our update is a semi-gradient step