GamingGPUs

NVIDIA RTX “Turing” GPU Architectural Analysis: How RTX (Ray-Tracing) was Turned On

In terms of raw performance, NVIDIA’s RTX 20 series or the Turing architecture offers a relatively smaller performance boost over the preceding GTX 10-series Pascal cards. Instead, most of NVIDIA’s R&D and marketing budget for Turing went into implementing ray-tracing, otherwise known as RTX with a focus on AI-based upscaling, DLSS.

Although the Turing architecture includes feature notable improvements to the graphics core, the inclusion of the new RTCores along with the Tensors is what really stands out.

The TU102: Exploring the Turing Flagship

When a new generation of graphics card is planned, there are mainly just 2-3 GPU core GPUs in the pipeline, dubbed Gx102, Gx104, etc. The fully enabled GPU is used in the flagship products and its cut down variants form the consumer lineup. For the Pascal family (GTX 10 series), this was the GP102, powering the GTX 1080 Ti and the Titan X/Xp. The chief GPU block of the Turing lineup is the TU102. Here are its specs:

  • CUDA Cores: 4,608
  • RT Cores: 72
  • Tensor Cores: 576
  • Texture Units: 288
  • Memory Config: 384-bit (32-bit x 12)
  • ROPs: 96
  • L2 Cache: 6144 KB (512KB x 12)

Like previous generations, it’s the Titan that includes the full-scale x102 die. The x80 Ti is a cut-down variant that sells for a relatively lower price, serving enthusiast gamers. In the RTX 20 series, that would be the GeForce RTX 2080 Ti. Although it’s not exactly cheap at $1,000, it doesn’t feature the entire TU102 die. That honor goes to the Titan RTX.

Like the 1080 Ti, the 2080 Ti also loses four SMs and the accompanying cores, Tensors, RT and the standard shaders, included. One memory controller is also disabled along with eight render output units. You can compare the cut-down TU102 (used in the Ti) to the unmodified die employed in the Quadro RTX 6000 in the specs tables above and below.

The Turing SM (Streaming Multiprocessor)

With every new GPU microarchitecture, NVIDIA redesigns (or rather rearranges) its SM or Streaming Multiprocessor. The core count per block is changed, the arrangement of the Load/Store Units and the Special Function Units is modified and at times the cache configuration is overhauled. This time all three have undergone a makeover:

Unlike the Big Pascal SM, Turing has 64 CUDA cores for both the Turing dies partitioned into four clusters. In turn, each core cluster has three distinct core groups: FP32, INT32 and Tensors.

Concurrent INT and FP Execution

With Pascal, you could only run FP32 or INT32 instructions in a cycle, not both. This reduced instruction-level parallelism and caused delays with programs that consisted of both kinds of instructions. Turing has separate FP32 and INT32 cores, allowing simultaneous execution of integer and floating-point workloads. This is NVIDIA’s version of Asynchronous Compute. The implementation and function are different, but the purpose is the same: to maximize GPU utilization.

Turing has separate FP32 and INT32 cores, allowing simultaneous execution of integer and floating-point workloads. This is NVIDIA’s version of Asynchronous Compute.

NVIDIA claims that this concurrent execution has improved the performance by up to 36%.

Turing SM

However, on the down-side, each core cluster gets one dispatch instead of two. This means that Turing can’t issue a second instruction from a thread each cycle. This drawback is minimized by the way Turing handles instruction scheduling. It takes each SM two clock cycles to complete one set of instructions, but at the same time, the schedulers can issue a second instruction (FP32 or INT32) per cycle as well. Therefore Turing inherits Pascal’s two instruction per cycle capability as well as twice the sub-SMs.

Big Pascal

Another major change is with respect to thread scheduling. As you know, NVIDIA uses a super-scalar architecture dubbed as SIMT. Although not exactly SIMD in nature, the scheduling works on a per warp basis with each warp consisting of 32 threads. One individual Turing SM runs 32 warps compared to 64 on Pascal and Volta.

One individual Turing SM runs 32 warps compared to 64 on Pascal and Volta.

However, like Volta, Turing also has independent thread scheduling. While Pascal and Maxwell had threads scheduled per warp, Turing has a per-thread scheduling, with a program counter stack per-thread to track thread state as well as a convergence optimizer to group threads from the same warp into SIMT units. Basically, all threads are equally concurrent, regardless of which warp or SM they are being executed by and can yield and reconverge.

Pascal SM

Furthermore, FMA (Fused Multiply-Add) operations, are up to 50% faster on Turing requiring only four clock cycles, compared to six cycles on Pascal.

Other than that, there’s another obvious difference, namely the inclusion of Tensor and the RTCores specialized for mixed-precision compute and BVH transversal, respectively.

Shared Memory Architecture

The Turing SM also features a new shared memory architecture. With Pascal, the L1 (24 KB x2) and Shared Memory (96 KB) for the Load Store Units were separate. Turing combines the two to make the whole system more flexible. The L1 cache can be reconfigured to 64 KB for recurring instructions or reduced to 32 KB if needed to boost the shared memory.

This shared memory architecture boosts the overall bandwidth, reduces the latency, and drastically improves the cache hit rate, in addition to simplifying resource allocation. As per NVIDIA’s in-house tests, the new cache shared memory architecture boosts performance per core by up to 50% in case of repetitive workloads.

The L2 cache size has also been doubled to compliment the improved L1 cache.

The Tensor Cores and DLSS

Gaming workloads leverage FP32 compute but AI and neural networks don’t require full precision and are often based on FP16 or half-precision compute. The Tensor cores use this principle to boost inferencing and neural network training.

In addition to FP16, the Turing Tensor cores are specialized for INT8 and INT4 precision modes. Although the precision in these two is much lower, they are suited for inferencing where large deviances can be tolerated.

Armed with these low-precision modes, the Tensor core speed up matrix-matrix multiplications required in neural network training and machine learning.

These help in gaming by allowing the use of DLSS (Deep Learning Super Sampling). It works by rendering images on NVIDIA’s Super Computers at a very high resolution (64x Super Sampling) and then comparing these high-res images with traditionally rendered ones. The Neural Network then compares the two and is trained to make the normally rendered images as identical to the high-res 64xSS as possible.

NVIDIA DLSS 2.0: Improved Quality and Temporal Vectors

This is a time-intensive task and it takes the algorithm several days to produce matching images, but the result is quite impressive. The output images are similar to the 64x Super Sampled images and don’t suffer from blurring and loss sharpness that often comes with Temporal anti-aliasing methods.

NVIDIA claims that DLSS lets the RTX 2080 Ti perform 2x faster than the GTX 1080 Ti while producing almost identical quality
In DLSS 2x, the input is rendered at the target resolution and then combined by a larger DLSS network to produce an output image that approaches the level of 64x SS

A Tensor Core can perform 64 FMA (Fused Multiply and Add) using the FP16 arithmetic per clock. The Tensor cores in an SM can perform 512 such FP16 tasks per clock. Using the INT8 precision mode, that figure swells up to 2,048 integer operations per clock (2x as much). This helps boost neural networks and deep learning training multiple times over.

RT Cores and Ray Tracing (RTX)

The RT Cores (along with the Tensors) are the most obvious addition to the Turing architecture. They help boost ray-tracing by offloading the intensive parts of the process from the shaders while rasterization continues as normal. The RTCores accelerate BVH (Boundary Volume Hierarchy) transversal and ray-triangle intersection testing. What you need to know is that they speed up the whole process by executing the intensive parts and letting the shaders do their own thing. For a more detailed picture of how ray-tracing is accelerated by the RTCores, read the following piece:

What is Ray-Tracing and RTX: How it All Works

Compared to Pascal, the Turing architecture is up to 10x faster when it comes to ray-tracing performance.

Compared to Pascal, the Turing architecture is up to 10x faster when it comes to ray-tracing performance. This boost comes from the addition of the RT Cores as well as the improved INT32 performance of Turing.

GDDR6 Memory, Faster L2 Cache, and Turing Memory Compression

NVIDIA’s GeForce RTX 20 series cards were the first to feature the new GDDR6 memory standard. With a bandwidth in excess of 600GB/s and memory speeds as high as 15 Gbps, it made HBM 2 obsolete in the consumer space.

GDDR6 is up to 20% more power-efficient than GDDR5X with much higher overclocking capabilities. You’re looking at base clocks as high as 14 Gbps and a 40% reduction in signal cross talk, significantly lowering memory corruption.

With a bandwidth in excess of 600GB/s and memory speeds as high as 15 Gbps, GDDR6 made HBM 2 obsolete in the consumer space.

Furthermore, Turing features advanced memory compression leading to an overall increase of 50% in the memory bandwidth over Pascal. While about ~20% comes from the traffic reduction (compression), the rest is derived from the higher memory clocks.

Difference Between DDR4 vs GDDR5 vs GDDR6 Memory: DDR4 vs LPDDR4 Comparison; What’s HBM2?

Lastly, the new Turing GPUs also include a faster and larger L2 cache to complement the faster GDDR6 memory. Compared to the Pascal-based GP102, the Turing flagship, TU102 packs 6MB of L2 cache, twice as much as its predecessor. This especially helps in games that leverage tiled rendering, allowing the GPU to store more data on-chip, reducing the power draw and bandwidth requirement.

Mesh Shading

With DirectX 11 and older GPU architectures, each object had its own CPU draw call and to top it off, the present shader model works on a per-thread basis. This means that the various stages of the pipeline have to be sequentially executed rather than in parallel, even if the resources required are independent.

With Pascal, the various stages of the pipeline had to be sequentially executed rather than in parallel, even if the resources required were independent. Turing’s Mesh shaders fix this and make the model multi-threaded

Turing with the help of DirectX 12 helps fix this limitation using Task Shaders and Mesh Shaders. These two new shaders replace the various cumbersome shader stages involved in the DX11 pipeline for a more flexible approach. Refer to Pipeline State Objects in the following for more:

What is the Difference Between DirectX 11 vs DirectX 12: In-depth Analysis

The mesh shader performs the same task as the domain and geometry shaders but internally it uses a multi-threaded instead of a single-threaded model. The task shader works similarly. The major difference here is that while the input of the hull shader was patches and the output was the tessellated object, the task shader’s input and output are user-defined.

In the above scene, there are thousands of objects that need to be rendered. In the traditional model, each of them would require a unique draw call from the CPU. However, with the task shader, a list of objects using a single draw call is sent. The task shader then processes this list in parallel and assigns work to the mesh shader (which also works synchronously) after which the scene is sent to the rasterizer for 3D to 2D conversion.

This approach helps reduce the number of CPU draw calls per scene significantly, thereby increasing the level of detail.

Variable Rate Shading

Variable Rate Shading is a technique that lets developers improve in-game performance without having a notable impact on visual fidelity. This is done by reducing the shading rate on a per-pixel basis.

The scene is divided into various regions, depending on a number of factors such as frame to frame variation, motion, type of game, etc and the shading rate is varied for each region. For example, in the above image, the best approach would be to shade the car (blue region) every pixel while the area near the car could be shaded once per four pixels (green), and the road to the left and right could be shaded once per eight pixels (yellow).

VRS allows developers to spend more resources on the more “visible” parts of the scene. The less noticeable regions (mainly the peripheries) get reduced execution time, thereby increasing quality without any apparent reduction in performance

The outer regions don’t change much, frame to frame, and the player is mostly focused on the center of the road (and the car). As a result, reducing the shading rate of the outer areas drastically improves performance without having an apparent impact on visual quality. There are three types of VRS:

  • Content Adaptive Shading
  • Reduces shading rate in regions of slowly changing color.
  • Motion Adaptive Shading
  • Variably decreases shading rate of moving objects.
  • Foveated Rendering
  • Reduces shading rate in areas away from the viewer’s focus.

HEVC Encode, NVLink and Virtual Link

For streamers, Turing had a big surprise. The Turing video encoder allows 4K streaming while exceeding the quality of the X264 encoder. 8K 30FPS HDR support is another sweet addition.

Two more features that come with Turing are Virtual Link and NVLink SLI. The former combines the different cables needed to connect your GPU to a VR headset into one while the latter improves SLI performance by leveraging the high bandwidth of the NVLink interface.

VirtualLink supports up to four lanes of High Bit Rate 3 (HBR3) DisplayPort along with the SuperSpeed USB 3 link to the headset for motion tracking. In comparison, USB-C only supports four lanes of HBR3 DisplayPort or two lanes of HBR3 DisplayPort + two lanes SuperSpeed USB 3.

NVLink is another attempt to resurrect a dying technology. AFR based multi-GPU rendering rarely works. Just increasing the bandwidth won’t fix that. In today’s games, temporal filtering is essential and the hard fact is that it’s just not compatible with AFR (Alternate Frame Rendering). Till a new rendering technology for SLI is developed (DirectX 12’s perhaps?), multi-GPU systems are a moot point.

You might also like:

AMD Navi Deep Dive: How is RDNA Different from the GCN Architecture; Built From the Ground Up for Gaming

NVIDIA vs AMD Graphics Cards: Comparing the RDNA and Turing GPU Architectures

Areej

Computer Engineering dropout (3 years), writer, journalist, and amateur poet. I started Techquila while in college to address my hardware passion. Although largely successful, it suffered from many internal weaknesses. Left and now working on Hardware Times, a site purely dedicated to.Processor architectures and in-depth benchmarks. That's what we do here at Hardware Times!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to top button