General ๐ŸŽฒ Probability Distributions require an exponential number of parameters to describe. For example, for two binary variables and , we need to define the probabilities for (except the last one).

To model complex distributions, we make assumptions that simplify the distribution, essentially forcing certain dependence conditions to reduce the number of parameters needed to represent the distribution. To do this, we can model probabilities using a graph, where each random variable is a node. Due to this structure, these models are also called probabilistic graphical models (PGMs).

  1. ๐Ÿšจ Bayesian Networks naturally model causation and conditional distributions using directed edges.
  2. ๐Ÿ—ณ๏ธ Markov Random Fields and ๐ŸŒพ Conditional Random Fields model more general relationships between random variables using undirected edges.

Note

Bayesian networks can be expressed with a product of factors similar to MRFs. Specifically, the factors are the conditional probability distributions: each node forms a factor with .

Intractability

However, inference and computing the partition function (a normalizing constant for the distribution) are often intractable. We must then solve them heuristically or approximate them using a variety of methods below.

Intractable Inference

Inference is the task of computing probabilities or expectations from a model like or ; some problems may also look for other properties like .

With complex models, inference computations require exponential-time calculations or involve unknown values. For example, given a model with latent variables , our likelihood calculation for a specific can be found by marginalizing over or with the chain rule. 1.

Both methods pose a challenge: the former requires us to integrate over all latent variables, which is often intractable, and the latter requires knowing the ground truth latents .

To get around this impossible process, we use different ways of calculating or approximating inference questions.

  1. ๐Ÿฟ Kernel Density Estimation estimates purely from drawn samples.
  2. ๐Ÿ”ซ Variable Elimination performs exact inference by optimizing the computation order and strategy.
  3. ๐Ÿ• Belief Propagation and ๐ŸŒฒ Junction Trees simulate message-passing to infer true marginals or MAP arguments.
  4. ๐Ÿงฌ Evidence Lower Bound lower bounds the evidence , turning inference into an optimization problem for tightening the bound.
  5. ๐Ÿ˜ก Mean Field Approximation uses the ELBO to optimize a specific variational distribution via coordinate ascent.

Intractable Partition

Many probabilistic models, especially ones using undirected edges, are defined by an unnormalized probability distribution . Their distribution needs to be normalized

by partition function , defined as follows:

If is intractable, we need to optimize log likelihood using some approximate methods.

  1. ๐Ÿ–– Contrastive Divergence estimates the gradient , which requires the gradient of , using ๐ŸŽฏ Markov Chain Monte Carlo.
  2. ๐Ÿ™ƒ Pseudo-Likelihood maximizes a proxy objective thatโ€™s computable with unnormalized .
  3. ๐ŸŽผ Score Matching optimizes the derivative of the distribution with respect to , which allows us to avoid the gradient of the partition.
  4. ๐Ÿ“ฃ Noise Contrastive Estimation trains a classifier on a joint model-noise distribution; optimizing this supervised problem gives us optimal weights and the partition function for our original model.
  5. ๐Ÿฅƒ Annealed Importance Sampling approximates the ratio between two partitions, and with a tractable second partition, allows us to directly approximate .