Attention mechanisms compute a weighted sum of values, paying different levels of attention to each one. Given the queries
- Compute the score
for each key using some score function. Intuitively, the higher the score, the more compatible the key is with the query. - Compute the softmax over scores to get weights
. - Output the weighted sum
of values.
In the simplest case, with the score function being a dot-product, we can write the attention mechanism as
However, there are many other score functions. The most common ones are below:
- Additive:
where and are learned weights and is the concatenated vector of and . - Dot-product:
. - General:
where is learnable. - Scaled dot-product:
where is the dimension of the key space.
Info
Note that additive attention was originally used in sequence translation, and the equation above generalizes some subtleties from the model.
Variants
Since the attention concept is extremely general, it can be applied to a variety of models (most notably ๐ฆพ Transformers). In these integrations, there are a variety of names for specific attention configurations.
- Self-attention refers to generating the query, key, and values from the same source. Intuitively, one example is focusing on different parts of a single sentence as we scan through it.
- Cross-attention uses one source for the query and another source for the key and values. This is analogous to focusing on a sentence while we generate another one, like in translation.