During a Reddit Q&A, NVIDIA answered some of the most itching questions gamers and the press had, including the SM structure, memory buffer, RTX IO, etc. The very first question was with respect to the largely unchanged memory buffer of the RTX 3080 compared to its predecessor.
Replying to this question, NVIDIA’s Justin Walker explained that as per the company’s analysis 10GB will be sufficient to run all present and upcoming games at 4K ultra without running into any memory bottlenecks. Furthermore, he disclosed that all the latest AAA games such as Shadow of the Tomb Raider, Metro Exodus, AC: Odyssey, Borderlands 3 run amply well with just 4-6GB of memory usage on an RTX 3080 (at 4K). In the end, Walker conceded that it’s always better to have more memory, but increasing it past 10GB would have made the 3080 unnecessarily pricier.
[Justin Walker] We’re constantly analyzing memory requirements of the latest games and regularly review with game developers to understand their memory needs for current and upcoming games. The goal of 3080 is to give you great performance at up to 4k resolution with all the settings maxed out at the best possible price.
In order to do this, you need a very powerful GPU with high speed memory and enough memory to meet the needs of the games. A few examples – if you look at Shadow of the Tomb Raider, Assassin’s Creed Odyssey, Metro Exodus, Wolfenstein Youngblood, Gears of War 5, Borderlands 3 and Red Dead Redemption 2 running on a 3080 at 4k with Max settings (including any applicable high res texture packs) and RTX On, when the game supports it, you get in the range of 60-100fps and use anywhere from 4GB to 6GB of memory.
Extra memory is always nice to have but it would increase the price of the graphics card, so we need to find the right balance.
Ampere Streaming Multiprocessor (SM)
The Ampere SM has two datapaths or pipelines per SM. Each of the four partitions consists of two clusters of ALUs: A set of 16 FP32 cores and a set of 32 FP32 and INT16 each. As a result of this new partitioning, each Ampere SM partition can execute either 32 FP32 instructions per clock or 16 FP32 and 16 INT32 instructions per cycle. You’re essentially trading integer performance for twice the floating-point capability. Fortunately, as the majority of graphics workloads are FP32, this should work towards NVIDIA’s advantage.
Overall, all four SM partitions combined can execute 128 FP32 operations per clock or 64 FP32, and 64 INT32 operations per clock.
One of the key design goals for the Ampere 30-series SM was to achieve twice the throughput for FP32 operations compared to the Turing SM. To accomplish this goal, the Ampere SM includes new datapath designs for FP32 and INT32 operations. One datapath in each partition consists of 16 FP32 CUDA Cores capable of executing 16 FP32 operations per clock. Another datapath consists of both 16 FP32 CUDA Cores and 16 INT32 Cores. As a result of this new design, each Ampere SM partition is capable of executing either 32 FP32 operations per clock or 16 FP32 and 16 INT32 operations per clock. All four SM partitions combined can execute 128 FP32 operations per clock, which is double the FP32 rate of the Turing SM, or 64 FP32 and 64 INT32 operations per clock.
Doubling math throughput required doubling the data paths supporting it, which is why the Ampere SM also doubled the shared memory and L1 cache performance for the SM. (128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing). Total L1 bandwidth for GeForce RTX 3080 is 219 GB/sec versus 116 GB/sec for GeForce RTX 2080 Super.
The GPC is the dominant high-level hardware block with all of the key graphics processing units residing inside the GPC. Each GPC includes a dedicated Raster Engine, and now also includes two ROP partitions (each partition containing eight ROP units), which is a new feature for NVIDIA Ampere Architecture GA10x GPUs. More details on the NVIDIA Ampere architecture can be found in NVIDIA’s Ampere Architecture White Paper, which will be published in the coming days.Tony Tamasi, NVIDIA
To allow the use of two data paths and 2x FP32 performance, L1 cache bandwidth (and the associated shared memory) had to be doubled as well: 128 bytes/clock per Ampere SM versus 64 bytes/clock in Turing. Total L1 bandwidth for the RTX 3080 is 219 GB/sec versus 116 GB/sec for RTX 2080 Super.
The raster back-end has also been buffed up. Each GPC now has a raster engine with two ROP partitions, with each packing eight ROPs. This means you have sixteen ROPs instead of eight for every 32-bit memory controller. This results in a total ROP count of 160 for the RTX 3080 and 192 for the 3090.