GamingGPUsNews

NVIDIA RTX 4090/4090 Ti with 18,432 FP32 Cores and 24GB GDDR6X Memory to Offer 100 TFLOPs TP at 2.8GHz [Report]

The specifications of NVIDIA’s AD102 “Lovelace” GPU die have been finally confirmed. The fully enabled core will pack an incredible 18,432 FP32 cores, a sizable increase over the AMD Navi 31’s 12,288 shaders. As reported the other day, the RTX 4090 will come with a couple of GPCs partially fused off, bringing down the effective core count to 16,128. If NVIDIA ever plans to launch a 4090 Ti that behemoth will leverage the full-fat AD102 core and its 18,432 shaders.

According to Kopite7kimi, the AD102 die will be able to touch the 100 TFLOPs single-precision performance mark with its core running at 2.8GHz. What he still isn’t sure about is the SM structure. NVIDIA has a habit of mildly restructuring its SM (Compute Unit) every generation. This time around it might be thoroughly overhauled, much like with Maxwell roughly eight years back, or not.

I’ll recap what I had shared a while back about the Maxwell SM and the possible SM design of Lovelace:

This image has an empty alt attribute; its file name is Screen-Shot-2014-02-23-at-17.10.371.png

With Maxwell, the warp schedulers and the resulting threads per SM/clock were quadrupled, resulting in a 135% performance gain per core. It looks like NVIDIA wants to pull another Maxwell, a generation known for exceptional performance and power efficiency that absolutely crushed rival AMD’s Radeon offerings.

This image has an empty alt attribute; its file name is FR91uUracAEI5U7.jpeg
Ampere

This would mean that the overall core count per SM would remain unchanged (128) but the resources accessible to each cluster would increase drastically. Most notably, the number of concurrent threads would double from 128 to 256. It’s hard to say how much of a performance increase this will translate to but we’ll certainly see a fat gain. Unfortunately, this layout takes up a lot of expensive die space, something NVIDIA is already paying a lot of dough to acquire (TSMC N4). So, it’s hard to say whether Jensen’s team actually managed to pull this off or shelved it for future designs.

This image has an empty alt attribute; its file name is FR91u6taQAAQBGd-1024x710.jpeg
Lovelace SM with 8 partitions
This image has an empty alt attribute; its file name is 2019-07-21-image-2-p_1100-1024x381.webp
Fermi vs Kepler vs Maxwell vs Turing SMs

There’s also a chance that Team Green decides to go with a coupled SM design, something already introduced with Hopper. In case you missed out on the Hopper Whitepaper, here’s a small primer on Thread Block Clusters and Distributed Shared Memory (DSM). To make scheduling on GPUs with over 100 SMs more efficient, Hopper and Lovelace will group every two thread blocks in a GPC into a cluster. The primary aim of Thread Block Clusters is to improve multithreading and SM utilization. These Clusters run concurrently across SMs in a GPC.

This image has an empty alt attribute; its file name is Screenshot-2022-05-16-at-20-41-41-NVIDIA-H100-Tensor-Core-GPU-Architecture-Overview-1024x393.png

Thanks to an SM-to-SM network between the two threads blocks in a cluster, data can be efficiently shared between them. This is going to be one of the key features promoting scalability on Hopper and Lovelace which is a key requirement when you’re increasing the core/ALU count by over 50%.

This image has an empty alt attribute; its file name is Screenshot-2022-05-16-at-20-41-02-NVIDIA-H100-Tensor-Core-GPU-Architecture-Overview-1024x318.png
GPUTU102GA102AD102AD103AD104
ArchTuringAmpereAda LovelaceAda LovelaceAda Lovelace
ProcessTSMC 12nmSam 8nm LPPTSMC 5nmTSMC 5nmTSMC 5nm
GPC671275
TPC3642724230
SMs72841448460
Shaders4,60810,75218,43210,7527,680
TP16.137.6~90 TFLOPs?~50 TFLOPs~35 TFLOPs
Memory11GB GDDR624GB GDDR6X24GB GDDR6X16GB GDDR616GB GDDR6
L2 Cache6MB6MB96MB64MB48MB
Bus Width384-bit384-bit384-bit256-bit192-bit
TGP250W350W600W?350W?250W?
LaunchSep 2018Sep 2020Aug-Sep 2022Q4 2022Q4 2022

These are the two potential ways NVIDIA can (nearly) double the core counts without crippling scaling or leaving some of the shaders underutilized. Of course, there’s always a chance that Jensen’s team comes up with something entirely new and unexpected.

Areej

Computer hardware enthusiast, PC gamer, and almost an engineer. Former co-founder of Techquila (2017-2019), a fairly successful tech outlet. Been working on Hardware Times since 2019, an outlet dedicated to computer hardware and its applications.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to top button