-greedy is a simple policy derived from values. Unlike a standard greedy policy that always takes the action that maximizes , -greedy also encourages exploration by occasionally choosing random actions.
Specifically, the policy is as follows.
is the hyperparameter that balances exploration and exploitation. Higher means more exploration, less exploitation. We usually gradually reduce during training since at the start, we want to explore more, but as our policy gets better, we want to exploit it more.