NVIDIA’s next-gen RTX 40 series graphics cards are all set to hit retail next month, starting with the GeForce RTX 4090. Leveraging up to 16,384 FP32 cores based on the Ada Lovelace microarchitecture, we’ve been promised performance gains of up to 4x in ray-traced games with DLSS 3. Much of these gains come from the improvements to the RT Cores. The RTX 4090 packs a total of 128 RT Cores, a moderate bump from the 84 featured on the RTX 3090 Ti:
Let’s look at how the RT Cores have changed from Ampere to Lovelace without getting into marking gimmicks. Ada gets two primary upgrades in this department in the form of the Opacity Micromap Engine and the Displaced Micro-Mesh Engine:
Opacity Micromap Engine
The Opacity Micromap Engine on the 3rd Gen RT Cores optimizes BVH traversal and intersection (the first step in ray-tracing) by simplifying the evaluation of transparent textures/meshes. It leverages an opacity micromap which is a virtual mesh of micro-triangles, each with an opacity state used to resolve ray intersections with non-opaque triangles.
Complex translucent textures like vegetation, fire, or would severely tax the GPU, prolonging the render time. To implement transparency in ray-traced scenes, developers tag the necessary objects as “not opaque”. When such an object/texture is intersected by a ray, multiple shaders are invoked to determine whether to continue testing (and determine the final color of the triangle/box) or return the result, even if the ray is classified as a hit or a miss. Therefore, this ray or set of rays ends up bottlenecking the entire warp even if the rest terminate immediately.
As noted in the above figure, the opacity mask divides the leaf into numerous small triangles, which are then classified based on one criterion: Whether they contain no leaf, part of the leaf, or its entirety. This cuts the need for shader invocations for most fragments by determining the entirely opaque and transparent features, leaving the shaders to deal with the partially opaque/transparent triangles. In best-case scenarios, this can reduce shader work by two-thirds:
Displaced Micro-mesh Engine
Like the Opacity Micromaps, displaced micro-meshes are another way to optimize BVH structures and accelerate traversal for faster ray-tracing capabilities. Where the former is instrumental with highly detailed and/or transparent textures, the latter helps boost performance with complex geometry and high poly meshes without losing detail.
The 3rd Gen RT Core on Ada needs a much simpler input to output complex tessellated geometry, courtesy of displaced micro-meshes. The displaced micro-mesh is a new geometric primitive (like a triangle) leveraging spatial coherence to compress high-fidelity textures. Each is defined by a base triangle and a displacement map. The micro-mesh engine within the RT Core uses this definition to generate highly detailed, tessellated meshes from a rather plain base mesh as shown below:
By feeding the RT Core with a coarser, simpler mesh, the complexity of the traversal and intersection tests is significantly reduced, all the while generating high-fidelity, ray-traced meshes in the end.
Shader Execution Reordering (SER)
Divergent RT shaders are one of the primary factors limiting performance in ray-traced workloads, something quite familiar with specular lighting, path tracing, and other multi-bounce intersections. Divergence usually arises when the threads from a given warp start working on different shaders or memory resources, reducing the parallel nature of otherwise vector workloads. This usually means tracing rays acting on different meshes, materials, or BVH structures.
In the above figure, the cat statue, the floor, the wall, and the ceiling are distinct structures with shaders and/or resources. Under existing designs, the threads from a wrap will likely cast rays intersecting different BVH structures in the scene, leading to divergence and unnecessary overhead. With Shader Execution Reordering (SER), these threads are reordered so that the rays from a given wrap are directed to the same BVH structures/objects, thereby improving utilization and execution time.
SER is more helpful with secondary ray tracing or path tracing, commonly utilized with specular and indirect lighting. As these rays shoot off in random directions across the screen, they tend to be much more divergent. Shader Execution Reordering adds a new stage in the ray tracing pipeline where these diverging threads are reordered and grouped as per the shaders and resources requested by them.
NVIDIA claims a performance boost of up to 2x in divergent RT workloads such as path tracing. In Cyberpunk 2077 (Uber RT mode), an overall gain of 25% was observed with SER. Unfortunately, it needs to be implemented on an application basis, so existing games won’t benefit from it.
Optical Flow Accelerator and DLSS 3
The Optical Flow Accelerator was introduced with the Ampere graphics architecture but didn’t get much exposure last gen. Optical flow estimation is used to measure the direction and magnitude of the motion of pixels between consecutively rendered frames (temporally). With Ada Lovelace, the capabilities of the optical flow engine (OFA) have been doubled to 300 TeraOPS (TOPS), feeding the DLSS 3 network to boost frame rates by a factor of 3 to 4 times. I’ll touch upon DLSS 3 in a separate post.
Ada 4th Gen Tensor Core
The Tensor core counts and design are essentially unchanged. The primary gains come in terms of mixed precision compute. The 4th Gen Tensor cores double the FP16, BF16, TF32, INT8, and INT4 Tensor TFLOPS. They also include the Hopper FP8 Transformer Engine, delivering over 1.3 PetaFLOPS tensor processing with the RTX 4090.
AD102: Powering the RTX 4090 with 16,384 Cores and 72MB of L2 Cache
The Ada SM is identical to Ampere with 64 cores leveraging a dedicated FP32 datapath and another 64 with a shared FP32/INT32 datapath for a total of 128 cores. The L1 data cache has been consolidated from 96KB on Ampere to 128KB on Ada. The L2 cache also gets a significant uplift, going from just 6MB to 72MB on the RTX 4090.
Each of the four partitions in an SM packs a 64KB register, an L0 instruction cache, one warp scheduler and dispatch unit, 16 cores paired with an FP32 datapath, and 16 cores sharing an FP32 and an INT32 datapath. This results in a peak theoretical throughput of 32 FP32 operations per clock or 16 FP32 + 16 INT32 operations per clock.
With the 40 series lineup, NVIDIA has paired most high-end and midrange SKUs with even faster GDDR6X memory. The RTX 4080 features 22.4Gbps GDDR6X memory, while the RTX 4090 packs 24Gbps GDDR6X memory for a peak bandwidth of 1 TB/s.
And finally, the process node that made all this possible: TSMC’s N4 which is an optimization of the N5 process. Courtesy of this cutting-edge EUV node, the AD102 squeezes in 70% more cores than the GA102 while running at boost clocks close to 3 GHz and remaining under 500W. From what we’ve seen, the thermals are also quite reasonable for such a large, dense die.