Theory

The Naive Bayes assumption states that each attribute is conditionally independent given class .

Naive Bayes uses a generative model that estimates how the training data is generated; to generate , pick category using , then generate data with using the Naive Bayes assumption.

Specifically, we have probabilities for each class, , and probabilities of each word occurring in a class, .

We can estimate these probabilities using frequencies in the data. To avoid probabilities from being , we apply Laplace smoothing, adding a prior of for each feature (analogous to assuming weโ€™ve seen each feature once before).

To predict the class of a given , we want to find the class that maximizes , which is found with ๐Ÿช™ Bayesโ€™ Theorem, defined below.

Model

Our model consists of probabilities of each class and probabilities of each feature given each class , both based on the training data.

Training

Given training data , calculate the vocabulary .

For each label ,

For each word in the vocabulary,

Prediction

Given a document consisting of words .

Return the class that maximizes the probability of generating ,

During implementation, we calculate the following instead to avoid underflow.

Info

For document classification, in the prediction step, we assume that there is no information in the words we did not observe; a standard probabilistic model would include probability of words not being observed for each word in the vocabulary, instead of only probability of words that appear in the input document