Sarsa is an on-policy ⌛️ Temporal Difference Learning method. The core update rule is

for all state-action to state-action sample transitions following our policy . This update requires the tuple hence the name.

Like in On-Policy, our policy needs to cover all state-action pairs in order to estimate the Q-function. Thus, we generally use 💰 Epsilon-Greedy to choose our next action.

Expected Sarsa

Expected Sarsa eliminates the single-sample estimate for by replacing with an expectation. Our update rule is

but this also increases computational cost since we now need to consider all possible actions .

Semi-Gradient Sarsa

If we approximate with a function , our update is a semi-gradient step

Explorer

🧭 Sarsa

Expected Sarsa

Semi-Gradient Sarsa

Table of Contents

Backlinks

Graph View

Explorer

🧭 Sarsa

Expected Sarsa §

Semi-Gradient Sarsa §

Table of Contents

Backlinks

Graph View

Expected Sarsa

Semi-Gradient Sarsa