General ๐ฒ Probability Distributions require an exponential number of parameters to describe. For example, for two binary variables
To model complex distributions, we make assumptions that simplify the distribution, essentially forcing certain dependence conditions to reduce the number of parameters needed to represent the distribution. To do this, we can model probabilities using a graph, where each random variable
- ๐จ Bayesian Networks naturally model causation and conditional distributions using directed edges.
- ๐ณ๏ธ Markov Random Fields and ๐พ Conditional Random Fields model more general relationships between random variables using undirected edges.
Note
Bayesian networks can be expressed with a product of factors
similar to MRFs. Specifically, the factors are the conditional probability distributions: each node forms a factor with .
Intractability
However, inference and computing the partition function (a normalizing constant for the distribution) are often intractable. We must then solve them heuristically or approximate them using a variety of methods below.
Intractable Inference
Inference is the task of computing probabilities or expectations from a model like
With complex models, inference computations require exponential-time calculations or involve unknown values. For example, given a model with latent variables
Both methods pose a challenge: the former requires us to integrate over all latent variables, which is often intractable, and the latter requires knowing the ground truth latents
To get around this impossible process, we use different ways of calculating or approximating inference questions.
- ๐ฟ Kernel Density Estimation estimates
purely from drawn samples. - ๐ซ Variable Elimination performs exact inference by optimizing the computation order and strategy.
- ๐ Belief Propagation and ๐ฒ Junction Trees simulate message-passing to infer true marginals or MAP arguments.
- ๐งฌ Evidence Lower Bound lower bounds the evidence
, turning inference into an optimization problem for tightening the bound. - ๐ก Mean Field Approximation uses the ELBO to optimize a specific variational distribution via coordinate ascent.
Intractable Partition
Many probabilistic models, especially ones using undirected edges, are defined by an unnormalized probability distribution
by partition function
If
- ๐ Contrastive Divergence estimates the gradient
, which requires the gradient of , using ๐ฏ Markov Chain Monte Carlo. - ๐ Pseudo-Likelihood maximizes a proxy objective thatโs computable with unnormalized
. - ๐ผ Score Matching optimizes the derivative of the distribution with respect to
, which allows us to avoid the gradient of the partition. - ๐ฃ Noise Contrastive Estimation trains a classifier on a joint model-noise distribution; optimizing this supervised problem gives us optimal weights and the partition function for our original model.
- ๐ฅ Annealed Importance Sampling approximates the ratio between two partitions, and with a tractable second partition, allows us to directly approximate
.