XF-Blog
ProjectMachine LearningdevelopmentAbout
MACHINE LEARNING PAPER NOTE
[Paper Note] NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

The core method of this approach is to represent a 3D scene using a Multilayer Perceptron (MLP). The network takes a 3D position and a viewing direction as input and outputs the corresponding color and volume density.

Given the focal length and the camera’s image plane, a ray is cast from the camera through each pixel. Along this ray, a series of points are sampled. These points, along with the direction of the ray, are fed into the MLP. The outputs are then integrated using a volume rendering formula to determine the final color of the pixel. The model is trained using a Mean Squared Error (MSE) loss between the rendered pixel and the ground truth.

fig2

Related Work

Neural networks can be used to record textures or indirect illumination within 3D scenes.

Neural 3D Shape Representations

Mesh and Voxel Approaches

Method

The MLP takes a 3D position (x,y,z)(x, y, z) and a 2D viewing direction (θ,ϕ)(\theta, \phi) as input.

Input Generation

MLP Output

Volume Rendering

The color of a ray C(r)C(r) is calculated as:

C(r)=tntfT(t)σ(r(t))c(r(t),d)dtC(r) = \int_{t_n}^{t_f} T(t) \sigma(r(t)) c(r(t), d) dt

In simple terms, the integral is the product of:

To approximate this integral numerically, NN points are sampled by dividing the space between the near and far bounds into NN bins and sampling uniformly within each bin. This randomness helps the MLP learn a continuous function.

Discretization

The integral is approximated using a discrete sum:

C^(r)=i=1NTi(1exp(σiδi))ci\hat{C}(r) = \sum_{i=1}^N T_i (1-\exp(-\sigma_i \delta_i))c_i

Implementation Details

Positional Encoding: To capture high-frequency details, scalars are encoded into 20-dimensional vectors using sine and cosine functions.

Hierarchical Volume Sampling: Two MLPs (coarse and fine) are used. The coarse network is sampled first. The resulting weights T(1exp(σδ))T * (1 - \exp(-\sigma \delta)) create a probability density function (PDF). More points are then sampled from this PDF for the fine network. Finally, all samples are rendered through the fine network.

Training

Results

Baseline Methods

Evaluation Metrics

Ablation Study

references

baseline methods