The multi-armed, or -armed, bandit is a simplification of the general reinforcement learning problem. Weโ€™re in a single state, and weโ€™re given the option to choose between different actions, each yielding some stochastic reward; this is analogous to choosing different arms of a bandit, or slot machine, hence the name.

Our objective, as usual, is to maximize expected total reward. Let the true value of an action be

where and are the reward and selected action at time step . If we estimate with , then we can solve our problem by choosing the highest-valued action at each time step.

The problem is that to estimate accurately, we need to explore different actions. However, to achieve our objective, we need to exploit the best action we already know. This balance between exploration and exploitation is at the core of all reinforcement learning algorithms.

Action-Value Estimation

Sample-Average

One simple estimate for is the average of our past rewards for ,

๐Ÿ™๐Ÿ™

If we havenโ€™t chosen in the past (meaning the denominator is ), we can set to some default value. At our time step , we can then choose a greedy action

that exploits our current estimates. Alternatively, we can use ๐Ÿ’ฐ Epsilon-Greedy, which chooses a random action with probability and exploits otherwise.

After choosing an action, we perform the iterative update

after observing reward for action . This is equivalent to recalculating the average using our first formula, but this form allows us to formulate the general update rule

that is widely used in other algorithms.

Exponential Recency-Weighted Average

Sometimes, in non-stationary problems, the reward distributions might change over time. Thus, instead of a simple average, we want to place higher weight on more recent samples with the update rule

for some step size . Expanding out this formula gives us

where is the default value and the summation loops over all past rewards. This thus gives us a weighted average that decays exponentially for older samples.

Gradient Bandit

Alternatively, we can forgo any notion of and instead learn a preference for action that has no interpretation in terms of reward. Weโ€™ll pick actions according to the softmax defined by these preferences,

To update our preferences after selecting action and receiving reward :

is a baseline measure thatโ€™s usually set to the average of all past rewards (not just for ). Intuitively, we increase preference for if is better than average and decrease if is worse; we also move the other actions in the opposite direction.

The above update rule comes directly from a derivation from โ›ฐ๏ธ Gradient Descent; weโ€™re performing a gradient step in the direction of greater expected reward,