NVIDIA has massively overhauled the SM (streaming multi-processor) design of its GPUs with the upcoming RTX 40 series “Lovelace” GPUs with twice as many dispatch and warp schedulers. This leak comes directly from the premier source on NVIDIA’s future products, kopite7kimi which makes it even more surprising. Over the past few generations, the SM has been tweaked in several ways with changes to the datapath, FP32/INT32 core ratio, and the cache/memory. However, since the Maxwell microarchitecture (GTX 900 series) of 2014, we haven’t seen an increase in the control logic.
With Maxwell, the warp schedulers and the resulting threads per SM/clock were quadrupled, resulting in a 135% performance gain per core. It looks like NVIDIA wants to pull another Maxwell, a generation known for exceptional performance and power efficiency that absolutely crushed rival AMD’s Radeon offerings.
This would mean that the overall core count per SM would remain unchanged (128) but the resources accessible to each cluster would increase drastically. Most notably, the number of concurrent threads would double from 128 to 256. It’s hard to say how much of a performance increase this will translate to but we’ll certainly see a fat gain. Unfortunately, this layout takes up a lot of expensive die space, something NVIDIA is already paying a lot of dough to acquire (TSMC N4). So, it’s hard to say whether Jensen’s team actually managed to pull this off or shelved it for future designs.
It’s also worth noting that NVIDIA is reverting to its pre-Turing SM datapath with Lovelace merging the FP32 and INT32 cores into the same cluster. This means that integer compute will take a backseat but that shouldn’t be an issue as gaming workloads are predominantly FP32.
Update: More potential SM designs: