NVIDIA has provided more info about the Ampere architecture powering the GeForce RTX 3070, 3080 and 3090. We now know a lot more about the Ampere SM and the FP32, INT32, Tensors and RT Cores within. Let’s start with the new Ampere SM:
Ampere Streaming Multiprocessor (SM): A Step Back in the Right Direction
The most interesting part about the new RTX Ampere GPUs was with respect to the floating-point or FP32 performance. NVIDIA claims a 2x increase in the FP32 performance per SM with Ampere, thanks to a 2x increase in the FP32 core count, from 64 to 128 per SM. This however does come at a cost. While Turing had support for concurrent INT and FP instruction support in the ratio of 1:1, Ampere goes a step back and uses a common pipeline or CUDA core cluster for both integer and floating-point workloads.
Update: Here are some detailed diagrams of the Turing and Ampere SMs, courtesy of Hiroshige Goto from PC Watch.
Every FP32 CUDA core in an SM is an SIMD16 unit that takes two cycles to resolve a warp or a thread-group just like Turing. Unlike Turing, one set of cores is specifically for FP32 workloads while the other can do either a warp of INT or FP threads per two cycles.
As far as the Tensor cores are concerned, the earlier 2nd Gen Tensors with Turing were 64-lane wide with INT4/INT8/FP16 support. The 3rd Gen Tensor Cores with Ampere are twice as wide with 128 lanes and support for sparsity further improves overall mixed precision performance.
Each of the four partitions in an SM has two datapaths or pipelines; One with a cluster of 32 CUDA cores purely dedicated to FP32 operations while another that can do both, FP32 or INT32. This means that the 2x FP32 or 128 FMA per clock of performance that NVIDIA is touting will only be true when the workloads are purely composed of FP32 instructions which is rarely the case. This is why we don’t see an increase of 2x in performance despite the fact that the core count increases by the same figure.
This also means that when you have INT32 instructions as well which are mostly compute, then some of those FP32/INT32 cores will be used for the latter, reducing the peak FP32 bandwidth. As I said, the 128 FMA per SM figure is an impractical best case scenario and for the most part, you’ll get around 75 to 90 FMA. This is, however, still a notable step up over Turing as Integer workloads are much lower compared to FP32 and as such, shader utilization should be notably better with this configuration.
To allow scheduling of both integer and floating-point workloads, the L1 cache bandwidth had to be doubled: 128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing. Total L1 bandwidth for the RTX 3080 is 219 GB/sec versus 116 GB/sec for RTX 2080 Super.
2nd Gen RT Cores
While the basic BVH acceleration and ray-triangle testing is unchanged, NVIDIA has added an additional component to the RTCore for interpolating the triangle position before the ray-triangle intersection testing.
The 2nd Gen RT cores introduce a rather interesting feature, called motion blur acceleration. With the Turing RT cores, performing intersection tests with objects (triangles) that were in motion was difficult (and slower) as the applied motion blur made it harder to pin-point which triangle was hit by the ray.
The added interpolation triangle unit helps exactly with that. It works in parallel to the intersection unit, interpolating the position of the triangle, thereby speeding up the process while also making it more accurate.
An interesting thing to note here is that NVIDIA’s RT Cores speed up the BVH traversal, ray-box intersections as well as ray-triangle intersections, meaning they’re completely independent of the primary shader cores as well as concurrent. AMD’s RDNA 2, on the other hand, seems to be using the shaders used for the primary pipeline for BVH traversal and only the intersection testing will be done by the dedicated RT hardware. Furthermore, the RT hardware shares resources with the texture units and as such only one of them can function at a time, further increasing the overhead with ray-tracing:
3rd Gen Tensor Cores
Now, we have the 3rd Gen Tensor cores. Although the count has decreased from eight to four per SM in Ampere, the performance has increased significantly. This has been done by borrowing the sparse deep learning technology from the A100 GPU accelerator. What this basically does is that it optimizes the matrix by removing the redundant zero weights and compressing the matrix.
This significantly reduces the matrix size, thereby speeding up the time needed to resolve it. Despite packing half as many Tensor cores, an RTX 3080 featuring the 3rd Gen Tensor cores is twice times as fast compared to Turing in mixed-precision FP16 compute. Without the use of sparse matrices, the Tensor performance is unchanged
Turing vs Ampere Render Time Per Frame
It’s also notable to note that every SM partition can run the RT, Tensor, and FP32/INT32 workloads in parallel, given you have ample register/cache bandwidth available. Turing was limited to parallel FP32+INT32+RT/Tensor pipelines. The concurrent RT and Tensor pipelines help reduce the render time of a frame with ray-tracing and DLSS by 0.8 ms. Overall, including the improved FP32 performance, the time is cut by nearly half: 13ms to 6.7ms.
GA102 GPU: The Die Powering the RTX 3080 and 3090
The GA102 GPU powering the RTX 3080 and 3090 has 7 GPCs (Graphics Processing Clusters) which are divided into five or four TPCs (Texture Processing Clusters) which in turn is composed of two SMs. It consists of 6,144kB of L2 cache for the 3090 and 5,120kB for the 3080 (The TU102 had 6,144kB) while the GA104 packs even lesser at 4,096kB. The RTX 3080 packs 6 active GPCs with one GPC disabled. This results in an overall core count of 8,704 CUDA cores or 68 SMs. For the RTX 3090, (almost) the entire die is enabled (except two SMs) resulting in a total core count of 10,496 or 82 SMs.
Furthermore, two of the memory controllers in the RTX 3080 are disabled reducing the bus-width from 384-bit to 320-bit. Interestingly, in the case of the RTX 3080, the number of TPCs vary from GPC to GPC. This is because, in two of the GPCs, there are only five TPCs instead of six. Even with the “BFGPU”, the RTX 3090, two SMs are disabled (not shown in the above block diagram).
|RTX 3090||RTX 3080||RTX 3070|
|Base clock||1,400 MHz||1,440 MHz||1,500 MHz|
|Boost clock||1,700 MHz||1,710 MHz||1,730 MHz|
|VRAM||24 GB||10 GB||8 GB|
|Memory Speed||19.5 Gbps (GDDR6X)||19Gbps |
|Bus Width||384 bits||320 bits||256 bit|
|Availability||24th September||17th September||October|
Furthermore, this is also the first time that the RTX x90|Titan and the RTX x80 share the same die. Usually, it’s the x80 Ti and the Titan sharing the top-end die while the x80 and x70 are based on the same GPU. This means that we’ll likely see an even larger Ampere die that will power the RTX 3080 Ti and a possible Titan.
Another major change with Ampere is that the ROPs are now hard-linked to the GPCs, rather than the memory controller/L2 cache partitions. Earlier, if you reduced the bus of a GPU by disabling a controller, the associated ROPs and L2 cache would also get axed. (Ultram) As a result, the RTX 3080 has one less GPC less than the RTX 3090.
RTX IO: Offloading Memory Decompression to the GPU
There are three ways to transfer data from the storage to the VRAM buffer. The traditional method is to send it via the PCIe lane (for NVMe drives) to the system as per I/O instructions from the CPU, from where it’s sent to the GPU memory, again via the PCIe bus. This is a very arduous and length process which wastes many CPU cycles as well as bandwidth. You can save the latter by compressing the data but, then the CPU will spend several cycles decompressing it, further increasing the overhead.
With RTX I/O, thanks to the low-level access given to developers by DirectStorage, the data can be directly sent from the SSD to the video memory, compressing and decompressing it using dedicated hardware in the GPU. Furthermore, DirectStorage also helps send batch I/O requests while earlier every I/O request big or small was individually sent, adding unnecessary load to the processor. RTX I/O or DirectStorage, however, needs to be implemented by the developers for every game engine.
As per NVIDIA’s own tests, RTX I/O cuts down the load times in an NVMe based SSD by nearly a fifth. The test was conducted using a 24-core AMD Threadripper processor using a Gen 4 NVMe drive, and it took all the cores 5 seconds to decompress and load the assets. With RTX IO, the same operation was performed on the GPU in just 1.5 seconds.
GDDR6X Memory and PAM4
NVIDIA is the first vendor to opt for GDDR6X memory in its RTX 30 series GPUs, at least the higher-end ones. It increases the per-pin bandwidth from 14Gbps to 21Gbps and the overall bandwidth to 1008GB/s, even more than a 3072-bit wide HBM2 stack.
|B/W Per Pin||21 Gbps||14 Gbps||11.4 Gbps||1.7 Gbps|
|Chip capacity||1 GB (8 Gb)||1 GB (8 Gb)||1 GB (8 Gb)||4 GB (32 Gb)|
|B/W Per Chip/Stack||84 GB/s||56 GB/s||45.6 GB/s||217.6 GB/s|
|Total B/W||1008 GB/s||672 GB/s||548 GB/s||652.8 GB/s|
|DRAM Voltage||1.35 V||1.35 V||1.35 V||1.2 V|
The secret sauce behind GDDR6X memory is PAM4 encoding. In simple words, it doubles the data transfer per clock compared to GDDR6 which uses NRZ or binary coding.
With NRZ, you had just two states, 0 and 1. PAM4 doubles it to four, 00, 01,10, and 11. Using these four states, you can send four bits of data per cycle (two per edge). The drawback with PAM4 is the high price especially at the higher frequencies of GDD6X. This is the reason why no one has tried to implement it in consumer memory.
Samsung 8nm Custom Node and Power Efficiency
Finally, let’s have a look at the 8nm custom node from Samsung that is thhe base for the consumer Ampere lineup. For starters, let’s make it clear that it’s nowhere close to the 7nm node on which the next-gen consoles and AMD’s Navi 2x GPUs will be fabbed. However, the fact that NVIDIA was able to gain a 50% performance per FPS gain is a testament to the efficiency of their GPU architecture.
|AMD Navi 2x||NVIDIA RTX 3080||NVIDIA RTX 3090||NVIDIA RTX 2080 Ti|
|Node||7 nm (TSMC)||8 nm (Samsung)||8 nm (Samsung)||12 nm (TSMC)|
|Transistor Count||54 billion||28 billion||17.4 billion||18.6 billion|
|Size||826 mm²||628.4 mm²||392.5 mm²||754 mm²|
|Transistor Density||65.37 MT / mm²||44.56 MT / mm²||44.33 MT / mm²||24.67 MT / mm²|
As you can see, the transistor density is nearly 50% more with TSMC’s N7 process and the transistor count is nearly twice as much. Now, let’s have a look at the revised power efficiency chart for Ampere:
At the same power draw of 240W, the NVIDIA Ampere architecture is 50% more power efficient or 50% faster per watt compared to Turing. Not quite near the 90% figure that NVIDIA claims but impressive regardless. I reckon if Jensen went ahead with TSMC’s 7nm node, it would have been quite possible to attain or perhaps even cross the 1.9x mark.
It would be safe to say that the cat is finally out of the bag. We know most of the technical details behind the architecture powering the RTX 30 series graphics cards. NVIDIA has even provided a set (cherry-picked) benchmarks for us to sift through, but as you can expect, we’d like to test them ourselves.
Recently, our relationship with NVIDIA’s PR team has been less than amicable: Turns out that they don’t like it if you ask too many questions especially with regard to how AMD technologies that seem better than rival GeForce ones. As a result, if our review gets delayed, you know why.
PS: I’m aware of the RTXGI (global illumination) technology that NVIDIA introduced with today’s presentation. It’s supposed to be faster and much more accurate than SVOGI. I’ll be covering it tomorrow.