Contextual bandits are a simplified reinforcement learning objective where we have multiple ๐ฐ Multi-Armed Bandits, each associated with its own state. Our states change randomly, and the objective is to learn a policy mapping state to action that maximizes total expected reward.
This is an intermediate between the multi-armed bandit, which only cares about which action to take, and the full reinforcement learning problem, which involves a ๐ Markov Decision Process with states and actions intertwined. The only difference between contextual bandits and MDPs is that our states are chosen randomly, meaning our actions donโt affect future state, whereas the full RL problem considers how actions affect future states.