Theory

Latent Dirichlet Allocation is an unsupervised method for document classification that comes up with the topic classes on its own.

We treat each document as a mixture of topics, and each topic has a different probability distribution of words. The topic distribution for a document and word distribution for a topic come from two Dirichlet distributions and . The figure below is an example of these two Dirichlet distributions.

Info

Each topic is similar to a ๐Ÿ‘ถ Naive Bayes model, producing words with different probabilities.

Fundamentally, LDA follows a generative process. To create a document,

  1. Choose a topic distribution for our document.
  2. For each topic , choose a word distribution .
  3. For each of the word spots, choose topic .
  4. Then, for each spot and its chosen topic , choose a word from the topicโ€™s distribution, .

It can be optimized with ๐ŸŽ‰ Expectation Maximization, alternating between calculating probabilities for and , then estimating parameters and .

Model

Consists of parameters , which defines the Dirichlet distribution for topics, and , the multinomial distribution of words for each topic.

During training, we maintain hidden variables , the multinomial distribution of topics, and , the chosen topic from this distribution.

Training

Given training data , randomly assign each word in a document to a topic, then estimate by counting the topics that come up in each document.

Then, alternate until convergence, starting with the M step, then E step.

  1. Given topic assignments, for each topic, estimate
  1. Using from earlier, for every token estimate

We can use this probability to pick new topics for each word, giving us and .

Prediction

To predict the topic for a given document, find the topic that maximizes the product of probabilities of words in the document belonging to that topic.