Theory

LSTM improves on the standard ๐Ÿ’ฌ Recurrent Neural Network by keeping two separate recurrence tracks for long-term and short-term memory. This avoids the problems that RNNs face with failing to retrieve information that occurred early in the sequence.

The short-term information is passed along similar to how RNNs maintain memory whereas the long-term information , known as the cell state, is maintained separately. The flow of information is maintained through gates.

  1. Forget gate uses and to choose parts of to โ€œforget,โ€ or set to zero.
  2. Input gate again uses and to choose parts of to update, forming .
  3. Output gate uses and to select parts of an activated to output as .

Model

The model structure is depicted below.

Note that sigmoids (in red) are used for selection since theyโ€™re bounded from to , and tanh (in blue) is used for activations. We โ€œchooseโ€ parts of or by multiplying with a tensor consisting of values between and .