Markov decision processes (MDPs) model the reinforcement learning environment, defining states
In a finite MDP, the dynamics can be described by the distribution
The new state and reward are only dependent on the current state and action; the states thus follow the Markov property, much like in โ๏ธ Markov Chains.
From these dynamics, we can get the state transitions
and the expected reward