NVIDIA’s tagline over the past two years has been “RTX ON/RTX OFF”. Memes and humor aside, a lot of work went into designing the Turing GPU architecture and turning RTX ON. Sure, the technology is still in its infancy, but it’s moving in the right direction. In terms of raw performance, NVIDIA’s RTX 20 series graphics cards offer a smaller performance boost over the preceding GTX 10-series Pascal cards. Instead, all of NVIDIA’s R&D and marketing budget for Turing went into implementing ray-tracing, otherwise known as RTX with a focus on AI-based upscaling, known as DLSS.
Although the Turing architecture includes feature notable improvements to the graphics core, the inclusion of the new RTCores along with the Tensor cores is what really stands out.
For the first time ever, NVIDIA has enabled the use of real-time ray tracing in PC gaming…at least partially Why is this a big deal, and how was the architecture design to support it work? Let’s dig in!
The TU102: Exploring the Turing Flagship
When a new generation of graphics card is planned, there are mainly just 2-3 GPU core GPUs in the pipeline, known as Gx102, Gx104, etc. The fully enabled GPU is used in the flagship products and its cut down variants form the consumer lineup. For the Pascal family (GTX 10 series), this was the GP102, powering the GTX 1080 Ti and the Titan X/Xp. The core GPU block of the Turing lineup is the TU102. Here are its specs:
- CUDA Cores: 4,608
- RT Cores: 72
- Tensor Cores: 576
- Texture Units: 288
- Memory Config: 384-bit (32-bit x 12)
- ROPs: 96
- L2 Cache: 6144 KB (512KB x 12)
Like previous generations, it’s the Titan that includes the full-scale x102 die. The x80 Ti is a cut-down variant that sells for a relatively lower price, serving enthusiast gamers. In this case, that would be the GeForce RTX 2080 Ti. Although it’s not exactly cheap at $1,000, it doesn’t feature the entire TU102 die. That honor goes to the Titan RTX.
Like the 1080 Ti, the 2080 Ti also loses four SMs and the accompanying cores, Tensors, RT and the standard shaders, included. One memory controller is also disabled along with eight render output units. You can compare the cut-down TU102 (used in the Ti) to the unmodified die employed in the Quadro RTX 6000 in the specs tables above and below.
The Turing SM (Streaming Multiprocessor)
With every new GPU microarchitecture, NVIDIA redesigns (or rather rearranges) its SM or Streaming Multiprocessor. The core count per block is changed, the arrangement of the Load/Store Units and the Special Function Units is modified and at times the cache configuration is overhauled. This time all three have undergone a makeover:
Unlike the Big Pascal SM, Turing has 128 cores for both the Turing dies with four partitions, but the layout has been overhauled. With Pascal, you could only run FP32 or INT32 instructions in each cycle, not both. This reduced instruction-level parallelism and caused delays with programs that consisted of both kinds of instructions. Turing has separate FP32 and INT32 cores, allowing simultaneous execution of integer and floating-point workloads. This is NVIDIA’s version of Asynchronous Compute. The implementation and function are different, but the purpose is the same: to maximize GPU utilization.
NVIDIA claims that this concurrent execution has improved the performance in-game by up to 36%.
However, on the down-side, each core cluster gets one dispatch instead of two. This means that Turing can’t issue a second instruction from a thread each core cycle. This drawback is minimized by the way Turing handles instruction scheduling. It takes each SM two clock cycles to complete one set of instructions, but at the same time, the schedulers can issue an independent instruction per cycle. Therefore Turing inherits Pascal’s two instruction per core-per cycle capability as well as twice the sub-SMs.
Another major change is with respect to thread scheduling. As you know, NVIDIA uses a super-scalar architecture dubbed as SIMT. Although not exactly SIMD in nature, the scheduling works on a per warp basis with each warp consisting of 32 threads. One individual core handles one warp. That means 128 warps for one Turing SM.
However, like Volta, Turing also has independent thread scheduling. While Pascal and Maxwell had threads scheduled per warp, Turing has a per-thread scheduling, with a program counter stack per-thread to track thread state as well as a convergence optimizer to group threads from the same warp into SIMT units. Basically, all threads are equally concurrent, regardless of which warp or SM they are being executed by and can yield and reconverge.
Furthermore, FMA (Fused Multiply-Add) operations, are up to 50% faster on Turing requiring only four clock cycles, compared to six cycles on Pascal.
Other than that, there’s another obvious difference, namely the inclusion of Tensor and the RTCores specialized for mixed-precision compute and BVH transversal, respectively.
Shared Memory Architecture
The Turing SM also features a new shared memory architecture. With Pascal, the L1 (24 KB x2) and Shared Memory (96 KB) for the Load Store Units were separate. Turing combines the two to make the whole system more flexible. The L1 cache can be reconfigured to 64 KB for recurring instructions or reduced to 32 KB if needed to boost the shared memory.
This shared memory architecture boosts the overall bandwidth, reduces the latency, and drastically improves the cache hit rate, in addition to simplifying resource allocation. As per NVIDIA’s in-house tests, the new cache shared memory architecture boosts performance per core by up to 50% in case of repetitive workloads.
The L2 cache size has also been doubled to compliment the improved L1 cache.
The Tensor Cores and DLSS
Gaming workloads leverage FP32 compute but AI and neural networks don’t require full precision and are often based on FP16 or half-precision compute. The Tensor cores use this principle to boost inferencing and neural network training.
In addition to FP16, the Turing Tensor cores are specialized for INT8 and INT4 precision modes. Although the precision in these two is much lower, they are suited for inferencing where large deviances can be tolerated.
Armed with these low-precision modes, the Tensor core speed up matrix-matrix multiplications required in neural network training and machine learning.
These help in gaming by allowing the use of DLSS (Deep Learning Super Sampling). It works by rendering images on NVIDIA’s Super Computers at a very high resolution (64x Super Sampling) and then comparing these high-res images with traditionally rendered ones. The Neural Network then compares the two and is trained to make the normally rendered images as identical to the high-res 64xSS as possible.
This is a time-intensive task and it takes the algorithm several days to produce matching images, but the result is quite impressive. The output images are similar to the 64x Super Sampled images and don’t suffer from blurring and loss sharpness that often comes with Temporal anti-aliasing methods.
A Tensor Core can perform 64 FMA (Fused Multiply and Add) using the FP16 arithmetic per clock. The Tensor cores in an SM can perform 512 such FP16 tasks per clock. Using the INT8 precision mode, that figure swells up to 2,048 integer operations per clock (2x as much). This helps boost neural networks and deep learning training multiple times over.
RT Cores and Ray Tracing (RTX)
The RT Cores (along with the Tensors) are the most obvious addition to the Turing architecture. They help boost ray-tracing by offloading the intensive parts of the process from the shaders while rasterization continues as normal. The RTCores accelerate BVH (Boundary Volume Hierarchy) transversal and ray-triangle intersection testing. What you need to know is that they speed up the whole process by executing the intensive parts and letting the shaders do their own thing. For a more detailed picture of how ray-tracing is accelerated by the RTCores, read the following piece:
Compared to Pascal, the Turing architecture is up to 10x faster when it comes to ray-tracing performance. This boost comes from the addition of the RT Cores as well as the improved INT32 performance of Turing.
GDDR6 Memory, Faster L2 Cache, and Turing Memory Compression
NVIDIA’s GeForce RTX 20 series cards were the first to feature the new GDDR6 memory standard. With bandwidths in excess of 600GB/s and memory speeds as high as 15 Gbps, it made HBM 2 obsolete in the consumer space.
GDDR6 is up to 20% more power-efficient than GDDR5X with much higher overclocking capabilities. You’re looking at base clocks as high as 14 Gbps and a 40% reduction in signal cross talk, significantly lowering memory corruption.
Furthermore, Turing features advanced memory compression leading to an overall increase of 50% in the memory bandwidth over Pascal. While about ~20% comes from the traffic reduction (compression), the rest is derived from the higher memory clocks.
Lastly, the new Turing GPUs also include a faster and larger L2 cache to complement the faster GDDR6 memory. Compared to the Pascal-based GP102, the Turing flagship, TU102 packs 6MB of L2 cache, twice as much as its predecessor.
With DirectX 11 and older GPU architectures, each object had its own CPU draw call and to top it off, the present shader model works on a per-thread basis. This means that the various stages of the pipeline have to be sequentially executed rather than in parallel, even if the resources required are independent.
Turing with the help of DirectX 12 helps fix this limitation using Task Shaders and Mesh Shaders. These two new shaders replace the various cumbersome shader stages involved in the DX11 pipeline for a more flexible approach. Refer to Pipeline State Objects in the following for more:
The mesh shader performs the same task as the domain and geometry shaders but internally it uses a multi-threaded instead of a single-threaded model. The task shader works similarly. The major difference here is that while the input of the hull shader was patches and the output of the tessellated object, the task shader’s input and output are user-defined.
In the above scene, there are thousands of objects that need to be rendered. In the traditional model, each of them would require a unique draw call from the CPU. However, with the task shader, a list of objects using a single draw call is sent. The task shader then processes this list in parallel and assigns work to the mesh shader (which also works synchronously) after which the scene is sent to the rasterizer for 3D to 2D conversion.
This approach helps reduce the number of CPU draw calls per scene significantly, thereby increasing the level of detail.
Variable Rate Shading
Variable Rate Shading is a technique that lets developers improve in-game performance without having a notable impact on visual fidelity. This is done by reducing the shading rate on a per-pixel basis.
The scene is divided into various regions, depending on a number of factors such as frame to frame variation, motion, type of game, etc and the shading rate is varied for each region. For example, in the above image, the best approach would be to shade the car (blue region) every pixel while the area near the car could be shaded once per four pixels (green), and the road to the left and right could be shaded once per eight pixels (yellow).
The outer regions don’t change much, frame to frame, and the player is mostly focused on the center of the road (and the car). As a result, reducing the shading rate of the outer areas drastically improves performance without having an apparent impact on visual quality. There are three types of VRS:
- Content Adaptive Shading
- Reduces shading rate in regions of slowly changing color.
- Motion Adaptive Shading
- Variably decreases shading rate of moving objects.
- Foveated Rendering
- Reduces shading rate in areas away from the viewer’s focus.
HEVC Encode, NVLink and Virtual Link
For streamers, Turing had a big surprise. The Turing video encoder allows 4K streaming while exceeding the quality of the X264 encoder. 8K 30FPS HDR support is another sweet addition.
Two more features that come with Turing are Virtual Link and NVLink SLI. The former combines the different cables needed to connect your GPU to a VR headset into one while the latter improves SLI performance by leveraging the high bandwidth of the NVLink interface.
VirtualLink supports up to four lanes of High Bit Rate 3 (HBR3) DisplayPort along with the SuperSpeed USB 3 link to the headset for motion tracking. In comparison, USB-C only supports four lanes of HBR3 DisplayPort or two lanes of HBR3 DisplayPort + two lanes SuperSpeed USB 3.
NVLink is another attempt to resurrect a dying technology. AFR based multi-GPU rendering rarely works. Just increasing the bandwidth won’t fix that. In today’s games, temporal filtering is essential and the hard fact is that it’s just not compatible with AFR (Alternate Frame Rendering). Till a new rendering technology for SLI is developed (DirectX 12’s perhaps?), multi-GPU systems are a moot point.
You might also like: