Markov decision processes (MDPs) model the reinforcement learning environment, defining states , actions , and rewards . At each time step, the agent takes an action based on the current state, causing itself to land in a new state and receive a reward. Its trajectory thus looks like

In a finite MDP, the dynamics can be described by the distribution

The new state and reward are only dependent on the current state and action; the states thus follow the Markov property, much like in โ›“๏ธ Markov Chains.

From these dynamics, we can get the state transitions

and the expected reward