A couple months ago, I replied to a tweet asking what the difference between a NeRF (neural radiance field) and 3DGS (3D Gaussian Splatting) is. It gained a lot of traction because I kept the explanation high-level for non-experts to understand. As promised, here is my blog post explaining that tweet further.
Inspired by WIRED’s 5 Levels of Difficulty playlist, this blog post takes the format of what I’m calling a “three-level summary” — three separate explanations increasing in levels of difficulty. I feel like most technical explanations online fail to capture high-level details and are way too technical. I hope reading this post helps you intuitively understand the difference between NeRFs and 3DGS, and afterwards you’re able to read and understand the corresponding papers a bit more easily. Disclaimer: this post is not meant to be formal, and details are heavily simplified for explanation purposes.
The goal of both NeRFs and 3DGS is to solve the following problem: given a few images of some 3D scene taken from different camera viewpoints, can we generate an image of this scene from any new camera viewpoint? This problem is formally called novel view synthesis, since we’re attempting to synthesize (generate) novel (new) views of some particular scene.
For example, below, we have three images from different camera viewpoints of a Lego bulldozer. How can we synthesize an image of the bulldozer from a new, imaginary camera viewpoint?
Taking it a step further, can we take a few pictures of some scene in real-life and then digitally “fly through” the scene in 3D?
Reliable, fast, and high-quality novel view synthesis makes it possible to capture immersive spatial memories, view apartments in 3D, film camera angles that aren’t physically or financially possible, and more… and is also foundational to what I’m working on, 3D telepresence.
Traditionally, people would use methods like photogrammetry to perform novel view synthesis. Photogrammetry essentially stitches many images of a scene together to create a 3D model of the scene. By rotating or moving that 3D model, you can capture new viewpoints of the scene. While this works, photogrammetry requires hundreds of high-resolution images, so it isn’t scalable. Here’s where NeRFs and 3DGS come in — using machine learning, these new techniques only require a couple dozen images to synthesize novel views. Now let’s dive into NeRFs vs. 3DGS.
A NeRF generates an image of a new viewpoint by outputting a color for each pixel. It uses machine learning to learn what colors to output.
Now, imagine a kid throwing different colors of paint onto a canvas to create an image. 3DGS is exactly this — 3DGS generates an image of a new viewpoint by drawing overlapping colored/transparent ‘splats’. 3DGS learns what to splat via machine learning.
3DGS renders new viewpoints almost 10x faster than the fastest implementations of NeRFs. However, 3DGS takes up 10x more memory than NeRFs.
We first need to understand rasterization vs. ray tracing, two different ways of rendering an image of a 3D scene in computer graphics. The details are out of scope, so for a more in-depth explanation see this. The key takeaways are that:
NeRFs use ray tracing to create an image of a new viewpoint. Instead of doing that complex math though, it uses machine learning to learn what color each ray outputs. The training data is the set of input images.
3DGS uses rasterization to create an image of a new viewpoint. Think of a set of splats as a point cloud, but instead of points, you have splats (3D ellipsoids). Then, you use rasterization to render a 2D image of this 3D point cloud. To view the scene from novel views, you just rotate or move the point cloud of splats first before rasterizing. Machine learning is used to create this collection of splats for a given scene. The training data is also the set of input images.
Because 3DGS uses rasterization and NeRFs use ray tracing, 3DGS rendering speeds are an order of magnitude higher than NeRFs. However, 3DGS creates millions of splats that take up 10x more memory than NeRFs, which are just neural networks.
NeRFs and 3DGS are different types of 3D scene representations1. If you have a 3D scene representation, you can perform novel view synthesis.
NeRFs build on ray tracing. A NeRF, represented as a neural network, takes as input a ray position and direction, and outputs a color2. Here’s some very high-level, simplified pseudocode to train a NeRF:
Nerf = randomly initialize a neural network
While average loss > threshold:
For each input image:
Generated image = initialize empty image of input image size
For each pixel in generated image:
Get ray position and direction for pixel
Color = NeRF(ray position, ray direction)
Generated image[pixel location] = color
Loss = Generated image - input image
Nerf.train(loss)
What you end up with after training is a trained neural network, which is really just a file with weights (numbers). Because the file looks like a list of random numbers and isn’t really interpretable to a human reader, NeRFs are called implicit scene representations.
3DGS builds on rasterization. The 3DGS algorithm essentially learns a point cloud of splats3 (ie. ellipsoids, or mathematically, 3D Gaussians) from input images, and once the point cloud is learned, you can place your camera wherever you want to view the point cloud from a novel view. Here’s some very high-level pseudocode to train 3DGS:
Splats = initialize random 3D Gaussians
While average loss > threshold:
For each input image:
Generated image = Rasterize(splats)
Loss = Generated image - input image
If a splat doesn't cover enough space in an area:
Clone a new splat in the area
If a splat covers too much area:
Split that splat into two
For each splat:
Modify splat parameters to reduce loss
No neural networks are used for 3DGS; instead, the algorithm iteratively learns to clone, split, or modify (ex. change the size or color of) splats based on the loss4, until the generated images look similar to the input images.
What you end up with after training 3DGS is a file with a list of Gaussian parameters for each Gaussian, such as the position, size, color and transparency. You feed this file into a rasterization-based renderer to render the splats given a viewpoint. Because the file is readable by a human, we call 3DGS an explicit scene representation. Trained 3DGS scenes can have millions of Gaussians.
While NeRFs and 3DGS both render 3D scenes from new viewpoints, they have different tradeoffs. NeRFs are just small neural networks that are tens of megabytes large. 3DGS, with its millions of Gaussians, occupies almost a gigabyte (10-100x difference in memory)5. However, because GPUs are designed for rasterization, 3DGS renders new viewpoints fast — more than 90 fps for some complex scenes. NeRFs on the other hand use the slower ray tracing method (with multiple slow neural network calls per-ray). For the same complex scene, Instant-NGP, one of the fastest NeRF implementations currently, records 10 fps6.
The good thing, though, is that both methods output very high quality, photorealistic results7 — the same quality as what older methods like photogrammetry would do, except with way fewer input images. Both NeRFs and 3DGS also typically take less than an hour to train most scenes; orders of magnitude faster than photogrammetry.
Finally, it’s basically impossible to directly convert between NeRFs and 3DGS. You could convert NeRFs and 3DGS both to 3D meshes though… and that’s left as an exercise for the reader.
NeRFs and 3DGS build upon different rendering paradigms, and consequently have different tradeoffs. Only time will tell which one takes off, as researchers are making improvements to each as we speak.
If you liked this explanation, follow me on Twitter! And feel free to DM me if you see any glaring mistakes.