[Update: More info regarding the SM design of Ampere has been shared. We’ve added the relevant info below.]
NVIDIA today unveiled its consumer Ampere graphics cards in the form of the GeForce RTX 3070, 3080, and 3090 with rather palatable price tags. NVIDIA claims a performance and efficiency gain of 2x over the preceding Turing lineup, but of course, these are the best-case figures, likely with ray-tracing and DLSS turned on. The actual raster performance will be a notch lower. Regardless, there are major technological improvements to both the standard graphics pipeline as well as the ray-tracing components:
New streaming multiprocessors: The Ampere GPUs feature newly designed Streaming Multiprocessors (AMD equivalent of Compute Unit), promising as much as twice the performance gain compared to Turing. NVIDIA’s using a new metric here, Shader-TFLOPs with each Ampere SM set to offer 30 S-TFLOPs of performance.
Traditionally, NVIDIA tweaks the SM every generation, changing the core count, SFUs and/or the arrangement of the shaders. With Turing we saw a major change to the SM, with the inclusion of separate Integer and Floating-Point cores for asynchronous execution as well as the addition of RT and Tensor cores. I don’t expect the same kind of overhaul this generation as well, although we should see a sizeable IPC gain per CUDA core (and SM), potentially due to a revamped wrap scheduler and dispatcher.
Second-gen RT Cores: New dedicated RT Cores deliver 2x the throughput of the previous generation, plus concurrent ray tracing and shading and compute, with 58 RT-TFLOPS of processing power.
Third-gen Tensor Cores: New dedicated Tensor Cores, with up to 2x the throughput of the previous generation, making it faster and more efficient to run AI-powered technologies, like NVIDIA DLSS, and 238 Tensor-TFLOPS of processing power.
As expected, both the RT and Tensor cores are getting a major uplift, delivering twice as much performance than their Turing brethren. Hopefully, ray-tracing will become more mainstream with these improvements.
NVIDIA RTX IO: Enables rapid GPU-based loading and game asset decompression, accelerating input/output performance by up to 100x compared with hard drives and traditional storage APIs. In conjunction with Microsoft’s new DirectStorage for WindowsAPI, RTX IO offloads dozens of CPU cores’ worth of work to the RTX GPU, improving frame rates and enabling near-instantaneous game loading.
This is similar to how the Velocity Architecture on the Xbox Series X improves the loading times and I/O throughput. Basically, DirectStorage and thereby RTX IO allows simultaneous loading of assets and textures in the background without breaking the immersion with a loading screen.
Previous gen games had an asset streaming budget of ~50MB/s which even at smaller 64k block sizes (ie. one texture tile) amounts to only hundreds of IO requests per second. With multi-gigabyte a second capable NVMe drives, to take advantage of the full bandwidth, this quickly explodes to tens of thousands of IO requests a second. Taking the Series X’s 2.4GB/s capable drive and the same 64k block sizes as an example, that amounts to >35,000 IO requests per second to saturate it.
Existing APIs require the application to manage and handle each of these requests one at a time first by submitting the request, waiting for it to complete, and then handling its completion. The overhead of each request is not very large and wasn’t a choke point for older games running on slower hard drives, but multiplied tens of thousands of times per second, IO overhead can quickly become too expensive preventing games from being able to take advantage of the increased NVMe drive bandwidths.
The DirectStorage API is architected in a way that takes all this into account and maximizes performance throughout the entire pipeline from NVMe drive all the way to the GPU.
It does this in several ways: by reducing per-request NVMe overhead, enabling batched many-at-a-time parallel IO requests which can be efficiently fed to the GPU, and giving games finer grain control over when they get notified of IO request completion instead of having to react to every tiny IO completion.MS
This is done by offloading the I/O instructions from the CPU to the GPU (in case of RTX IO), reducing the CPU overhead by as much as two-thirds. This becomes more and more important as 4K textures become more widespread and SSD speeds increase manifold. DS gives developers low-level access to the storage controller of SSDs, allowing for fine-grained optimizations.
GDDR6X Memory @ 19Gbps: NVIDIA has worked with Micron to create an improved variant of GDDR6 graphics memory for the RTX 30 Series, GDDR6X. It provides data speeds of up to 760 GB/s for the RTX 3080 and nearly 1TB/s for the 3090, without the use of expensive HBM memory.
Samsung 8nm process: It appears that Twitter was right all along. NVIDIA’s consumer Ampere lineup will stick with Samsung’s 8nm LPP process. NVIDIA calls this a “custom” 8nm process but I don’t think the density and efficiency will be any different.
Meet the RTX 3070, 3080 and 3090
GeForce RTX 3070: Starting at $499, the RTX 3070 is expected to be faster than the RTX 2080 Ti at less than half the price, and on average is 60 percent faster than the original RTX 2070. It is equipped with 8GB of GDDR6 memory, hitting the sweet spot of performance for games running at 4K and 1440p resolutions.
GeForce RTX 3080: Starting at $699, the RTX 3080 will be the next go-to graphics card for 4K enthusiasts and gamers. Featuring 10GB of GDDR6X memory running at 19Gbps, the RTX 3080 is supposed to be 80-90% faster than the RTX 2080 and around 30-40% faster than the RTX 2080 Ti.
GeForce RTX 3090: At the very top, we have the RTX 3090, priced at $1,499. Owing to that price-tag (similar to the Titan), it’ll be more of a titular flagship, with most gamers flocking to the 3080. With as much as 24GB of GDDR6X memory, NVIDIA is aiming this card at workstations rather than gamers. It’s expected to be as much as 50% faster than the RTX Titan based on Turing.
The above shader/core counts include both the Floating-point and Integer cores. To get the actual figure, divide the figure by half: 5,248 for the 3090, 4,352 for the 3080 and 2,944 CUDA cores for the 3070.
The Ampere SM has two datapaths or pipelines per SM. Each of the four partitions consists of two clusters of ALUs: A set of 16 FP32 cores and a set of 32 FP32 and INT16 each. As a result of this new partitioning, each Ampere SM partition can execute either 32 FP32 instructions per clock or 16 FP32 and 16 INT32 instructions per cycle. You’re essentially trading integer performance for twice the floating-point capability. Fortunately, as the majority of graphics workloads are FP32, this should work towards NVIDIA’s advantage.
Overall, all four SM partitions combined can execute 128 FP32 operations per clock or 64 FP32, and 64 INT32 operations per clock.
One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.
The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.Tony Tamasi, NVIDIA
To allow the use of two data paths and 2x FP32 performance, L1 cache bandwidth (and the associated shared memory) had to be doubled as well: 128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing. Total L1 bandwidth for the RTX 3080 is 219 GB/sec versus 116 GB/sec for RTX 2080 Super.
The raster back-end has also been buffed up. Each GPC now has a raster engine with two ROP partitions, with each packing eight ROPs. This means you have sixteen ROPs instead of eight for every 32-bit memory controller. This results in a total ROP count of 160 for the RTX 3080 and 192 for the 3090.
Misleading perf/W improvement: NVIDIA claims a performance per watt increase of 1.9x with Ampere, but if you look at the chart closely, this is somewhat misleading. For starters, the power efficiency is measured at a fixed level of performance rather than a fixed power draw which is the norm.
The GA102 die which powers the RTX 3080/3070 and 3090 is much larger than the TU102 die on which the Turing lineup is based. For the same reason, you see a massive performance gain despite the clocks being slightly lower. Essentially, NVIDIA has downclocked the new GPUs while increasing the count significantly and then compared it to Turing which features a much lower core count, but higher clock speeds. This comparison would be justified if both the parts were running at similar core clocks or the performance was measure at a fixed power draw.
If you make a graph with a fixed power draw, then Ampere is roughly 50% more efficient (in terms of the performance per watt metric), nowhere near the 1.9x figure that NVIDIA provides. This can be calculated by locking the power at 240W. At this wattage, Turing achieves 60 FPS while Ampere nets around 90 FPS, thereby a gain of 50% in terms of power efficiency.
This translates into the following:
- The RTX 3080 is 2x faster while increasing the power draw by nearly 50%, thereby offering a performance per watt gain of just 34%.
- The RTX 3090 seems even less impressive, with a performance gain of 50% by increasing the power draw by 25%, resulting in an efficiency gain of just 20%.
Overall, this explains why the Ampere cards have lower clocks than Turing despite featuring a smaller node. It’s because the shader count is much higher which in turn has increased the power draw by quite a bit.
In addition to the Founders Edition cards, the GeForce RTX 3090, 3080, and 3070 GPUs board partner cards will be available from AIB partners including ASUS, Colorful, EVGA, Gainward, Galaxy, Gigabyte, Innovision 3D, MSI, Palit, PNY, and Zotac. The GeForce RTX 3080 will be available starting Sept. 17. The GeForce RTX 3090 will be available starting Sept. 24. The GeForce RTX 3070 will be available in October. For a limited time, gamers who purchase a new GeForce RTX 30 Series GPU or system will receive a PC digital download of Watch Dogs: Legion and a one-year subscription to the NVIDIA GeForce NOW cloud gaming service.