When AMD announced its Radeon RX 6800 series and 6900 XT GPUs, otherwise known as Big Navi last month, the architectural details were quite sparse. There were two features that were the primary focus: The huge 128MB of L3 cache, dubbed as the Infinity Cache by AMD, and the relatively high core clocks, resulting in a much higher performance per watt ratio compared to RDNA 1.

However, it takes more than a good memory subsystem and shader clocks to design a competitive high-end GPU. That’s what we’ll explore in this post. We’ll be comparing the architectural differences between RDNA 1 and RDNA 2, including the throughput, render backend and frontend, compute and graphics pipelines, and the revamped memory system. We’ll also have a look at the new Compute Units which despite looking the same on the surface have undergone some significant changes.
The RDNA 2 Compute Units: Up to 30% More Performance at the Same TDP

The Compute Units are largely unchanged, with the exception of the one Ray-Accelerator which has been to each Compute Unit within the RDNA 2 parts. The 32-wide SIMD format that was introduced with Navi 10 is unchanged, the Scalar Units are still fixed at 2 per CU, and the L0 cache and Texture Units too are the same as RDNA 1.

Below you can see the raw throughput of the RDNA 1, 1.1, and 2.0 CUs. Now, you’ll likely ask what’s RDNA 1.1? Well, AMD didn’t say but I reckon it’s what powers the PS5 and most likely the Xbox Series X. Regardless, there’s no difference between the two designs, and it’s the Render backend which has really undergone a makeover with RDNA 2.

The native FP32/64 and INT32/64 performance is unchanged with RDNA 1.1 and RDNA2. It’s the mixed-precision performance that has doubled from 128 ops per cycle to 256. Furthermore, RDNA 1 doesn’t support low-precision modes such as INT8/INT32 and INT4/INT32. Both RDNA 1.1 and 2.0 can process 512 ops per cycle for the former and up to 1,024 for the latter.
The Render backend: RB+

Once again, AMD hasn’t provided many details on what has changed with the RDNA 2 RBs (RB+). What we do know is that the throughput has been doubled and fine-tuned for VRS and other optimizations that come with DX12 Ultimate.
Each RB+ can process 8 32-bit pixels, a 2x increase compared to RDNA 1 and 1.1. This is primarily the result of the doubled 32bpp color rate. The new multi-precision RBs are also supplied to the shader engine at twice the rate, primarily improving the performance with mixed-precision workloads such as VRS.
AMD Infinity Cache: Just Cache It!

The key change and the main driving factor behind the Big Navi GPUs is the huge L3 cache that AMD is calling Infinity Cache. It’s this breakthrough that has allowed AMD to reach that 1.5x perf-per-watt goal, boost clocks in excess of 2.5GHz as well as catching up with NVIDIA despite featuring the standard GDDR6 memory and a paltry 256-bit bus.


Although the low-level cache size is the same, the L0-L1 transfer speeds have been halved from 64B/clock to 32B/clock. The L1 to L2 bandwidth is the same @ 2048B/clock for the entire stack while the L2 to IC bandwidth is a rather impressive 1,024B/clock.


The use of the Infinity Cache not only reduces the memory latency by 34%, but it also improves cache hit-rates by as much as 58% at 4K. Essentially, the internal cache bandwidth behaves akin to a 1,024-bit bus paired with GDDR6 memory. However, the final memory bandwidth will still be capped at 512GB/s.

The Infinity Cache also improves the ray-tracing performance by storing the bandwidth-intensive BVH structures that tend to choke the limited cache bandwidth of most GPUs.
Ray-Tracing on RDNA 2

Another burning question we all had was with respect to AMD’s ray-tracing implementation. While NVIDIA uses a full-fledged hardware solution for ray-tracing, AMD has a hybrid approach based on a combination of dedicated hardware (called Ray Accelerators) and traditional shaders to implement ray-tracing.
While the RAs handle ray-box and ray-triangle intersections, the BVH traversal is handled by the SIMDs (shaders) with aid from the high-bandwidth of the Infinity Cache. Furthermore, as the RAs share resources with the Texture Units and only one of them can run per cycle. While I haven’t been able to confirm it, I believe that the RAs and SIMDs can’t run synchronously as with NVIDIA’s RTX GPUs. While software and game-engine optimization can help alleviate the performance hit of this drawback, natively the shader pipelines receive a penalty upon running ray-tracing code.
Mesh Shading and other DX12 Ultimate Features

A difference I noticed between AMD and NVIDIA’s implementation of mesh shading is that while the former uses the native DX12 Ult solution, NVIDIA has a more flexible approach which basically allows for lower draw-calls and allocation for more workload per draw-call. AMD’s solution primarily optimizes the workloads by early culling of primitives and data reuse.


NVIDIA also has an option that allows for early culling o primitives, but it’s separate from the core Mesh Shader and is called the Amplification Shader. Of course, the final decision will fall on the developers and how they decide to leverage the feature.


Variable Rate Shading and Sampler Feedback are primarily an API-level feature, rather than a hardware one and as such works in a similar way across all GPU architectures.
Conclusion

Overall, most of the performance advances with RDNA 2 come from the fine-tuned cache hierarchy, lower latency, better hit rates, and the high-frequency boost clocks. The primary change on the architecture side is with respect to the RBs which have had their throughput effectively doubled. The basic structure of the CUs, schedulers and Commmand processors are more or less identical.
