February 19, 2024 | 10 min read

Three Level Summary: Neural Radiance Fields vs. 3D Gaussian Splatting

Background

A couple months ago, I replied to a tweet asking what the difference between a NeRF (neural radiance field) and 3DGS (3D Gaussian Splatting) is. It gained a lot of traction because I kept the explanation high-level for non-experts to understand. As promised, here is my blog post explaining that tweet further.

Inspired by WIRED’s 5 Levels of Difficulty playlist, this blog post takes the format of what I’m calling a “three-level summary” — three separate explanations increasing in levels of difficulty. I feel like most technical explanations online fail to capture high-level details and are way too technical. I hope reading this post helps you intuitively understand the difference between NeRFs and 3DGS, and afterwards you’re able to read and understand the corresponding papers a bit more easily. Disclaimer: this post is not meant to be formal, and details are heavily simplified for explanation purposes.

The Problem

The goal of both NeRFs and 3DGS is to solve the following problem: given a few images of some 3D scene taken from different camera viewpoints, can we generate an image of this scene from any new camera viewpoint? This problem is formally called novel view synthesis, since we’re attempting to synthesize (generate) novel (new) views of some particular scene.

For example, below, we have three images from different camera viewpoints of a Lego bulldozer. How can we synthesize an image of the bulldozer from a new, imaginary camera viewpoint?

Taking it a step further, can we take a few pictures of some scene in real-life and then digitally “fly through” the scene in 3D?

Reliable, fast, and high-quality novel view synthesis makes it possible to capture immersive spatial memories, view apartments in 3D, film camera angles that aren’t physically or financially possible, and more… and is also foundational to what I’m working on, 3D telepresence.

Traditionally, people would use methods like photogrammetry to perform novel view synthesis. Photogrammetry essentially stitches many images of a scene together to create a 3D model of the scene. By rotating or moving that 3D model, you can capture new viewpoints of the scene. While this works, photogrammetry requires hundreds of high-resolution images, so it isn’t scalable. Here’s where NeRFs and 3DGS come in — using machine learning, these new techniques only require a couple dozen images to synthesize novel views. Now let’s dive into NeRFs vs. 3DGS.

Level 1

A NeRF generates an image of a new viewpoint by outputting a color for each pixel. It uses machine learning to learn what colors to output.

Now, imagine a kid throwing different colors of paint onto a canvas to create an image. 3DGS is exactly this — 3DGS generates an image of a new viewpoint by drawing overlapping colored/transparent ‘splats’. 3DGS learns what to splat via machine learning.

3DGS renders new viewpoints almost 10x faster than the fastest implementations of NeRFs. However, 3DGS takes up 10x more memory than NeRFs.

Level 2

We first need to understand rasterization vs. ray tracing, two different ways of rendering an image of a 3D scene in computer graphics. The details are out of scope, so for a more in-depth explanation see this. The key takeaways are that:

Rasterization is fast because it renders one object in a scene at a time. It does this by directly converting each 3D object into pixels to render. In fact, GPUs were designed for rasterization.
Ray tracing is slower because it renders one pixel at a time. It computes the color of each pixel in an image by simulating a ray of light fired from that pixel’s position. The math to determine how the ray of light bounces in a 3D scene is complex. That said, ray tracing tends to yield more photorealistic results because it is modeling physics.

NeRFs use ray tracing to create an image of a new viewpoint. Instead of doing that complex math though, it uses machine learning to learn what color each ray outputs. The training data is the set of input images.

3DGS uses rasterization to create an image of a new viewpoint. Think of a set of splats as a point cloud, but instead of points, you have splats (3D ellipsoids). Then, you use rasterization to render a 2D image of this 3D point cloud. To view the scene from novel views, you just rotate or move the point cloud of splats first before rasterizing. Machine learning is used to create this collection of splats for a given scene. The training data is also the set of input images.

Because 3DGS uses rasterization and NeRFs use ray tracing, 3DGS rendering speeds are an order of magnitude higher than NeRFs. However, 3DGS creates millions of splats that take up 10x more memory than NeRFs, which are just neural networks.

Level 3

NeRFs and 3DGS are different types of 3D scene representations¹. If you have a 3D scene representation, you can perform novel view synthesis.

NeRFs build on ray tracing. A NeRF, represented as a neural network, takes as input a ray position and direction, and outputs a color². Here’s some very high-level, simplified pseudocode to train a NeRF:

Nerf = randomly initialize a neural network
While average loss > threshold:
    For each input image:
        Generated image = initialize empty image of input image size
        For each pixel in generated image:
            Get ray position and direction for pixel
            Color = NeRF(ray position, ray direction)
            Generated image[pixel location] = color
        Loss = Generated image - input image
        Nerf.train(loss)

What you end up with after training is a trained neural network, which is really just a file with weights (numbers). Because the file looks like a list of random numbers and isn’t really interpretable to a human reader, NeRFs are called implicit scene representations.

3DGS builds on rasterization. The 3DGS algorithm essentially learns a point cloud of splats³ (ie. ellipsoids, or mathematically, 3D Gaussians) from input images, and once the point cloud is learned, you can place your camera wherever you want to view the point cloud from a novel view. Here’s some very high-level pseudocode to train 3DGS:

Splats = initialize random 3D Gaussians
While average loss > threshold:
    For each input image:
        Generated image = Rasterize(splats)
        Loss = Generated image - input image
        If a splat doesn't cover enough space in an area:
            Clone a new splat in the area
        If a splat covers too much area:
            Split that splat into two
        For each splat:
            Modify splat parameters to reduce loss

No neural networks are used for 3DGS; instead, the algorithm iteratively learns to clone, split, or modify (ex. change the size or color of) splats based on the loss⁴, until the generated images look similar to the input images.

What you end up with after training 3DGS is a file with a list of Gaussian parameters for each Gaussian, such as the position, size, color and transparency. You feed this file into a rasterization-based renderer to render the splats given a viewpoint. Because the file is readable by a human, we call 3DGS an explicit scene representation. Trained 3DGS scenes can have millions of Gaussians.

While NeRFs and 3DGS both render 3D scenes from new viewpoints, they have different tradeoffs. NeRFs are just small neural networks that are tens of megabytes large. 3DGS, with its millions of Gaussians, occupies almost a gigabyte (10-100x difference in memory)⁵. However, because GPUs are designed for rasterization, 3DGS renders new viewpoints fast — more than 90 fps for some complex scenes. NeRFs on the other hand use the slower ray tracing method (with multiple slow neural network calls per-ray). For the same complex scene, Instant-NGP, one of the fastest NeRF implementations currently, records 10 fps⁶.

The good thing, though, is that both methods output very high quality, photorealistic results⁷ — the same quality as what older methods like photogrammetry would do, except with way fewer input images. Both NeRFs and 3DGS also typically take less than an hour to train most scenes; orders of magnitude faster than photogrammetry.

Finally, it’s basically impossible to directly convert between NeRFs and 3DGS. You could convert NeRFs and 3DGS both to 3D meshes though… and that’s left as an exercise for the reader.

Takeaway

NeRFs and 3DGS build upon different rendering paradigms, and consequently have different tradeoffs. Only time will tell which one takes off, as researchers are making improvements to each as we speak.

If you liked this explanation, follow me on Twitter! And feel free to DM me if you see any glaring mistakes.

Formally, these representations both allow the rendering of a radiance field, which is a function of how light moves in a given 3D scene. If you think about it, all rendering is just an approximation of how light moves.

The neural network takes as input a 3D position in space and a direction, and outputs a color and volume density (opacity). To obtain the final color for a given ray, the neural network is called multiple times along points sampled along the ray, and then subsequently summed up, weighted by opacity values. This learned approximation of the color is derived from the rendering equation. Each call to the neural network is takes time, so as you can imagine multiple calls for each pixel slows down NeRFs a lot. There is work in caching, smarter sampling, and more to make NeRFs faster.

Technically, the correct term for these ellipsoids in 3D are blobs. When rendered in 2D, they are called splats. For the purposes of this post, we’ll just call them splats.

The algorithm uses gradient-based optimization; ie. it uses the gradients of the loss to clone, split, or modify splats. Specifically, gradient descent modifies the parameters of the splats to reduce the loss over training. The magnitudes of the gradients determine whether to split or clone splats at each training step, using experimentally-determined thresholds.

These results are cited from the original 3DGS paper (see Table 1) on well-known, standard NeRF datasets featuring large scenes. These numbers can be much lower if the scenes trained on are smaller.

The fps values are also cited from the original 3DGS paper; see the footnote above.

3DGS scenes actually demonstrate marginally higher photorealistic quality than NeRFs too.