The multi-armed, or -armed, bandit is a simplification of the general โ๏ธ Reinforcement Learning problem. Weโre in a single state, and weโre given the option to choose between different actions, each yielding some stochastic reward; this is analogous to choosing different arms of a bandit, or slot machine, hence the name.
Our objective, as usual, is to maximize expected total reward. Let the true value of an action be
where and are the reward and selected action at time step . If we estimate with , then we can solve our problem by choosing the highest-valued action at each time step.
The problem is that to estimate accurately, we need to explore different actions. However, to achieve our objective, we need to exploit the best action we already know. This balance between exploration and exploitation is at the core of all reinforcement learning algorithms.
One simple estimate for is the average of our past rewards for ,
If we havenโt chosen in the past (meaning the denominator is ), we can set to some default value. At our time step , we can then choose a greedy action
that exploits our current estimates. Alternatively, we can use ๐ฐ Epsilon-Greedy, which chooses a random action with probability and exploits otherwise.
After choosing an action, we perform the iterative update
after observing reward for action . This is equivalent to recalculating the average using our first formula, but this form allows us to formulate the general update rule
Sometimes, in non-stationary problems, the reward distributions might change over time. Thus, instead of a simple average, we want to place higher weight on more recent samples with the update rule
for some step size . Expanding out this formula gives us
where is the default value and the summation loops over all past rewards. This thus gives us a weighted average that decays exponentially for older samples.
Alternatively, we can forgo any notion of and instead learn a preference for action that has no interpretation in terms of reward. Weโll pick actions according to the softmax defined by these preferences,
To update our preferences after selecting action and receiving reward :
is a baseline measure thatโs usually set to the average of all past rewards (not just for ). Intuitively, we increase preference for if is better than average and decrease if is worse; we also move the other actions in the opposite direction.
The above update rule comes directly from a derivation from โฐ๏ธ Gradient Descent; weโre performing a gradient step in the direction of greater expected reward,