Markov Chain Monte Carlo (MCMC) methods conduct approximate sampling for an intractable distribution . Its properties are guaranteed for distributions with for all , so itโ€™s commonly applied to โšก๏ธ Energy-Based Models that guarantee this inequality via the Boltzmann distribution.

The core idea is to repeatedly update a state so that in enough iterations, becomes an approximately fair sample from . Formally, we have random state thatโ€™s arbitrarily initialized and a transition distribution that specifies the change in every iteration; has the form of a โ›“๏ธ Markov Chain, thus giving this algorithm its name.

Consider infinite states initialized and transitioned in parallel. At the beginning, these states have distribution (equivalent to whatever distribution we chose to arbitrarily initialize with), and after steps, these states form a distribution . Our goal is to make converge to via transitions

Our choice of varies with each MCMC algorithm; the most common solutions are โ™ป๏ธ Gibbs Sampling and ๐ŸšŠ Metropolis-Hastings.

Convergence

In order to converge to a equilibrium distribution, we need to satisfy two constraints:

  1. Irreducibility: itโ€™s possible to get from any state to any other state with non-zero probability in a finite number of steps.
  2. Aperiodicity: itโ€™s possible to return to any state at any time.

Irreducibility guarantees the absence of absorbing states that present the chain from leaving, and aperiodicity prevents us from alternating back-and-forth periodically.

Theory

The convergence of the random variableโ€™s distribution may not be intuitively obvious. The following theory explains it using the primary eigenvector of a corresponding transition matrix.

Consider a discrete variable . Let describe the distribution such that . Our above transition can then be described in terms of and a transition matrix ,

where .

Since the columns of represent a probability distribution, theyโ€™re called stochastic matrices. By the Perron-Frobenius theorem, if all transitions have non-zero probability, the largest eigenvalue of is real and equal to . As we multiply our by over and over for steps, its eigenvalues are exponentiated as follows:

Thus, all eigenvalues less than decay to zero, and under some mild conditions, has only one eigenvectors with eigenvalue , so with high enough , we reach the equilibrium distribution where

where being the largest eigenvector.

Generalizing this equilibrium to continuous variables, the above equation is theoretically equivalent to the following:

Detailed Balance Condition

The above explanation shows that MCMC will converge, but we donโ€™t know what it converges to. The following is a framework that relates our transitions to the equilibrium distribution.

Let and be two distinct random variables. The detailed balance condition states that if

is satisfied, we will converge to an approximation for . Thus, our choice of can be shown to be valid for MCMC. This condition can be applied to โ™ป๏ธ Gibbs Sampling and ๐ŸšŠ Metropolis-Hastings to show that their choice of transition allows us to approximate our target.

Sampling

Once the Markov chain reaches the equilibrium distribution (called โ€œburning inโ€ the chain during the โ€œmixing timeโ€), we can sample infinitely from the new distribution with

Note that this is the same formula as changing the distribution of with , but since weโ€™re in equilibrium, the distribution of doesnโ€™t change.

Note

Thereโ€™s no analytical way to check when weโ€™ve burned in the chain. Most methods involve some heuristic sampling, but the actual point where we start approximately sampling after burning in is arbitrary.

However, consecutive samples are identically distributed but not independent, so we often take every samples to reduce the bias. Another solution is to run multiple MCMC in parallel to break up the sequential samples.

Modes

A single chain generally hovers around a low energy region (a mode of the distribution) and rarely leaves it. As such, itโ€™s difficult for MCMCโ€™s sample to cover multiple modes of a complex distribution.

One solution is to modify our distribution to have higher entropy (less extreme peaks and valleys). In โšก๏ธ Energy-Based Models, we can change the term in the exponent; with (called tempering), we allow for easier mixing between modes.