Attention mechanisms compute a weighted sum of values, paying different levels of attention to each one. Given the queries and key-value pairs represented by and , we compute the generalized attention for each query as follows.

  1. Compute the score for each key using some score function. Intuitively, the higher the score, the more compatible the key is with the query.
  2. Compute the softmax over scores to get weights .
  3. Output the weighted sum of values.

In the simplest case, with the score function being a dot-product, we can write the attention mechanism as

However, there are many other score functions. The most common ones are below:

  1. Additive: where and are learned weights and is the concatenated vector of and .
  2. Dot-product: .
  3. General: where is learnable.
  4. Scaled dot-product: where is the dimension of the key space.

Info

Note that additive attention was originally used in sequence translation, and the equation above generalizes some subtleties from the model.

Variants

Since the attention concept is extremely general, it can be applied to a variety of models (most notably ๐Ÿฆพ Transformers). In these integrations, there are a variety of names for specific attention configurations.

  1. Self-attention refers to generating the query, key, and values from the same source. Intuitively, one example is focusing on different parts of a single sentence as we scan through it.
  2. Cross-attention uses one source for the query and another source for the key and values. This is analogous to focusing on a sentence while we generate another one, like in translation.