The specifications of NVIDIA’s Lovelace flagship have been supposedly finalized. The GeForce RTX 4090 leveraging the AD102 die will feature a total of 16,128 FP32 cores across 126 SMs, 63 TPCs, and 11 GPCs. This massive die will be paired with 24GB of 21Gbps GDDR6X memory across a 384-bit bus, the same as the RTX 3090 Ti. Lovelace will likely borrow some of the features of Hopper, especially Thread Block Memory Sharing which along with the massive 96MB of L2 cache drastically boost SM utilization and bandwidth, respectively.
In case you missed out on the Hopper Whitepaper, here’s a small primer on Thread Block Clusters and Distributed Shared Memory (DSM). To make scheduling on GPUs with over 100 SMs more efficient, Hopper and Lovelace will group every two thread blocks in a GPC into a cluster. The primary aim of Thread Block Clusters is to improve multithreading and SM utilization. These Clusters run concurrently across SMs in a GPC.
Thanks to an SM-to-SM network between the two threads blocks in a cluster, data can be efficiently shared between them. This is going to be one of the key features promoting scalability on Hopper and Lovelace which is a key requirement when you’re increasing the core/ALU count by over 50%.
Lastly, let’s not forget that the RTX 4090 won’t feature the full-fat AD102 die and yet offer twice as much performance as its predecessor. The TGP is eventually going to be “just” 450W, a far cry from the previously rumored 600-900W abominations. The RTX 4090 Ti which may launch later in the cycle with the fully enabled AD102 die is more likely to come with a 600W TGP.
|GPU||GA102||AD102||RTX 4090||AD103||RTX 4080||RTX 4070 Ti (AD104)||RTX 4070|
|Arch||Ampere||Ada Lovelace||Ada Lovelace||Ada Lovelace|
|Process||Sam 8nm LPP||TSMC 5nm||TSMC 5nm||TSMC 5nm|
|TP||37.6||~100 TFLOPs?||83 TFLOPs||~50 TFLOPs||47 TFLOPs?||~35 TFLOPs||35 TFLOPs?|
|Memory||24GB GDDR6X||48GB GDDR6X||24GB GDDR6X||16GB GDDR6X||12GB GDDR6X|
|Launch||Sep 2020||Sept 22?||Sept 22?||Q1 2023?|