Partially observable Markov Decision Processes (POMDPs) generalize ๐ŸŒŽ Markov Decision Process to make a distinction between the environmentโ€™s states and agentโ€™s observations .

On top of the states , actions , and rewards that define

we also have an observation space and emissions . The emissions capture probability distribution

or the probability of getting a certain observation from the current state.

Crucially, in a POMDP, the states satisfy the Markov property, whereas our observations might depend on the past. This influences the design of some reinforcement learning algorithms to avoid the Markov property with observations.

Learning

Like with MDPs, model-based reinforcement learning algorithms that have observations (images, for example) seek to learn the world transitions. In POMDPs, we need to learn both and . The former is called the dynamics model, and the latter is called the observation model.

We can use a similar log likelihood objective as MDPs, but since our true states are unknown, we need to take an expectation over them,

To compute this expectation, we need an approximate posterior encoder

much like with ๐Ÿ–‹๏ธ Variational Autoencoders. There are many choices for this posterior, ranging from the fully expressive to the single-step .