There has been a lot of speculation regarding AMD’s next-gen RDNA 3 GPUs which will be the first in the market to leverage a chiplet design in the gaming GPU segment. A new patent from AMD sheds some additional light on the same, while also going against what some fairly well-known leakers have been speculating. Most importantly, the patent doesn’t mention a separate I/O die, and all individual chiplets will be compute dies with roughly the same design. This means that the I/O operations will most likely be housed within the primary compute chiplet.
Furthermore, the active bridge chiplet which connects the individual dies using some integrated cache circuits isn’t on top of the die, rather it’s underneath the chiplets, embedded in the substrate. The active bridge chiplet links the various GPU chiplets, offering an external unified memory interface, allowing the chiplets to communicate with one other as well as synchronize workloads. The entire L3 cache exists on the same bridge chiplet which itself lies below the chiplets. Lastly, the memory channels connecting the memory to the GPU exist on each chiplet but are controlled by the primary chiplet alone.
As already mentioned at the beginning of the post, the three chiplets will be identical, each having the same design (due to production reasons), but the memory controller would only be enabled on the master or primary die. Each chiplet will have its own set of memory channels, but they will most likely be controlled by the master die and attached to the L3 cache on the active bridge. This means that if a single GPU chiplet has a bus width of 64-bits, two will result in a GPU with a bus width of 128-bit and three in 192-bit.
The most important (and problematic) aspect of multi-GPU configurations has been workload distribution between the different GPUs and synchronizing them in such a way that it leads to palpable gains over the single GPU configuration. AMD aims to solve this with the help of checkerboarding in its chiplet GPUs. Groups of mutually exclusive pixels (having no common pixels) spatially adjacent to one another will be rendered by the different chiplets, similar to how SFR (Split Frame Rendering) split the frame into two or more halves which were then rendered by the different GPUs.
However, in this case, the clustering will be more complex (with smaller checkerboards), preventing screen tearing and other AA-related problems. Each checkerboard (group of pixels) will be recognized as a work item (wave or part of a wave) sent over to different WGPs (Dual Compute Units) for processing and rendering. Furthermore, to make the distribution of the frame fine-grained, the screen geometry is divided at a mesh level (quads or triangles) in addition to screen space.
As for synchronization between the various chiplets, the command processors in each chiplet identify viable points in the pipeline and interrupt it, so that the rest of the chiplets can catch up. This will keep the various chiplets at the same work saturation, ensuring synchronization. Another important thing to keep in mind is that only the primary chiplet will issue waves to the secondary dies and communicate to the CPU via the bridge chiplet while also synchronizing.
Lastly, in terms of die configuration, we’re likely to see 2 or three chiplets as that’s the maximum TSMC’s CoWoS-L packaging technology is capable of at the moment. Furthermore, CoWoS-L technology is also entering mass production sometime in late 2021, right in time for the Navi 3x GPUs.