NeRF (Neural Radiance Field) creates a 3D scene from multiple images taken from different perspectives. In other words, given a bunch of images of an object, the network learns the color and density of the object at each coordinate in 3D space; density controls how much light passes through a particular coordinate (to model glass, for example).

The model, implemented as a MLP, generates the scene by outputting color and density for any given coordinate and viewing direction. Formally, we have

for color , density , 3D coordinate , and viewing direction ; is a positional encoding

that projects its input into higher dimensions.

Volume Rendering

We can render an image of the scene using volume rendering, which finds the color of every pixel by projecting a ray , then taking integral

where computes how much light makes it to along the ray and is the direction (angles and ) of the ray. The integral is computed using quadratures in implementation.

This essentially simulates light going through the ray and finds the expected color based on the probability of light reaching each position on the ray.

Coarse and Fine

The complete system actually uses two networks, one coarse and one fine: the former gives a rough idea of the density along each ray, and the latter uses this information to sample areas that have the highest density. In doing so, we allocate more samples to regions that are more likely to be visible in the final result.

Optimization

To optimize , we overfit to a particular scene, and the final network predicts only for this scene. Note that this is distinctly different from most machine learning objectives that aim to generalize across modelsโ€”here, we seek to generalize across views of the same scene, not across entirely different scenes. Specifically, we optimize by rendering an image of the model and comparing it with our ground truth images; formally, our loss is

where is our coarse rendering, is our fine rendering, and is the ground truth from the image.