Partially observable Markov Decision Processes (POMDPs) generalize ๐ Markov Decision Process to make a distinction between the environmentโs states
On top of the states
we also have an observation space
or the probability of getting a certain observation from the current state.
Crucially, in a POMDP, the states satisfy the Markov property,
Learning
Like with MDPs, model-based reinforcement learning algorithms that have observations (images, for example) seek to learn the world transitions. In POMDPs, we need to learn both
We can use a similar log likelihood objective as MDPs, but since our true states are unknown, we need to take an expectation over them,
To compute this expectation, we need an approximate posterior encoder
much like with ๐๏ธ Variational Autoencoders. There are many choices for this posterior, ranging from the fully expressive