Sequence-to-sequence models are used for translation, outputting a sequence from an input sequence. It uses ๐Ÿ’ฌ Recurrent Neural Networks, ๐ŸŽฅ Long Short-Term Memory, or โ›ฉ๏ธ Gated Recurrent Units to encode the sequence into a hidden state and then decode it into another sequence.

The core idea is that the encoderโ€™s output summarizes the entire sequence, and the decoder can use this summary to generate a response. This architecture is pictured below.

Attention

One core problem with the standard Seq2Seq approach is that it struggles with long sequences as information earlier in the sequence tend to get lost. The๐Ÿšจ Attention mechanism addresses this weakness by passing on the hidden states as well.

We add a step between the encoder and decoder that utilizes all hidden states from the encoder to figure out which hidden states are most relevant to each decoding time-step. In this application, the query is analogous to the previous decoder output and the keys and values are both analogous to the hidden states .

Then, instead of passing the encoder output directly to the decoder, we instead compute a weighted average of all hidden encoder states at each decoder time-step. At time of the decoding process, we perform the following.

  1. For each encoder hidden state and previous decoder output , use a feed-forward neural network to compute the th alignment score .
  2. Then, calculate a softmax over all scores to get the weights .
  3. Generate the context vector .

The context vector and previous decoder output are concatenated and given to the decoder to produce an output.