With the launch of the Big Navi graphics cards, AMD has finally returned to the high-end GPU space with a bang. While the RDNA 2 design is largely similar to RDNA 1 in terms of the compute and graphics pipelines, there are some changes that have allowed the inclusion of the Infinity Cache and the high boost clocks. In this post, we’ll talk about how the Navi GPUs are different from the traditional Vega and Polaris parts which were powered by the GCN architecture.
AMD’s GCN architecture powered Radeon graphics cards for almost a decade. Although the design had its strengths such as a powerful Compute Engine, hardware schedulers, and unified memory, it wasn’t very efficient for gaming. The hardware utilization was quite poor compared to contemporary NVIDIA parts, scaling dropped sharply after the first 11 CUs per shader engine, and overall, using more than 64 CUs per GPU wasn’t feasible.
As a result, despite featuring a powerful compute architecture, AMD’s GCN GPUs (Vega) repeatedly lost to NVIDIA’s high-end gaming products, all the while drawing significantly higher power.
The 1st Gen RDNA architecture powering the Navi 10 and Navi 14 GPUs (Radeon RX 5500 XT, 5600 XT, and 5700/XT) are based on the same building blocks as GCN: A vector processor with a few dedicated scalars for address calculation and control flow, separate compute and graphics pipelines running asynchronously. ALUs called stream processors provide the computational power, and the Command Processor (along with the ACEs) handles the workload scheduling per Compute Unit.
The core difference is that RDNA reorganizes the fundamental components of GCN for a higher IPC, lower latency, and better efficiency. That’s what Navi is all about: It does a lot more with notably less hardware!
AMD GCN: Powerful but Underutilized
AMD’s GCN graphics architecture consisted of 64 wavefronts or work-items (and ALUs/cores) per Compute Unit. These were divided into four SIMDs (Single Instruction On Multiple Data Types), each packing 16 ALUs (SP).
Here’s where most people get confused. Yes, it’s true that the scheduler could issue new wave groups after every four cycles, but at a time each Compute Unit would also work on four 64-item waves, not one 64-item wave. Like Bulldozer, the aim was to maximize parallelization. At the same time, GCN wasn’t an out-of-order architecture. The instructions within a wavefront were still executed as per their order. The difference was that the CU or SIMDs could switch to any of the four available waves.
The reason why this wasn’t very effective is that most games use shorter work queues due to which only one or two out of the four wavefronts were saturated per execution cycle. As a result, the competing NVIDIA GPUs with similar shader counts were much faster thanks to their Super-Scalar architecture and took only one to two cycles to execute these shorter dispatches. On the other hand, the AMD counterparts had to wait for four cycles for the next one despite having room for additional wavefronts.
To sum up, like many other SIMD designs, a GCN Compute Unit worked on four wavefronts at a time and took four cycles to execute them. In an ideal world, this would mean that the effective time taken for one wave is one cycle. But, because of the way SIMDs works, this wasn’t the case and the CUs were often left underutilized.
AMD RDNA: Dual Compute Architecture and Wave32
The RDNA architecture implemented in Navi uses wave32, a narrower wavefront with 32 work-items. It is much simpler and efficient than the older wave64 design. Each SIMD is wider but the Compute Units are narrower.
Where the Compute Unit was the basic shader unit in GCN, RDNA replaces it with a WGP (Work-Group processor): two CUs working in tandem with shared local data. The RDNA SIMDs consist of 32 shaders or ALUs, twice that of GCN. There are two SIMDs per CU and four in a Dual Compute Unit. The total number of stream processors in a CU is still 64, but they are distributed across two (not four) wider SIMDs. Refer to the Four cycles vs one cycle per instruction diagram further below.
This arrangement allows the execution of one whole wavefront in one clock cycle, reducing bottlenecks and boosting IPC by 4x. By completing a wavefront 4x faster, the registers and cache are freed up much faster, allowing the scheduling of more instructions overall. Furthermore, wave32 uses half the number registers as wave64, reducing circuit complexity and costs too.
To accommodate the narrower wavefronts, the vector register file has also been reorganized. Each vector general-purpose register (vGPR) now contains 32 lanes that are 32-bits wide (for FP32), and an SIMD contains a total of 1,024 vGPRs – again, 4X the number of registers as in GCN.
Overall, the narrower wave32 mode increases the throughput by improving the IPC and the total number of concurrent wavefronts, resulting in significant performance and efficiency boosts.
To ensure compatibility with the older GCN instruction set, the RDNA SIMDs in Navi support mixed-precision compute. This makes the new Navi GPUs suitable for not only gaming workloads (FP32), but also for scientific (FP64) and AI (FP16) applications. The RDNA SIMD improves latency by 2x in wave32 mode and by 44% in wave64 mode.
Asynchronous Compute Tunneling
One of the main highlights of the GCN architecture was the use of the Asynchronous Compute Engines way before NVIDIA integrated it into their graphics cards. RDNA retains that capability and doubles down on it.
The Command Processor handles the commands from the API and then issues them to the respective pipelines: The Graphics Command Processor manages the graphics pipeline (shaders and fixed-function hardware) while the four Asynchronous Compute Engines (ACE) take care of the Compute. The Navi 10 die (RX 5700 XT) has one Graphics Command Processor and four ACEs. Each ACE has a distinct command stream while the GCP has individual streams for every shader type (domain, vertex, pixel, raster, etc).
The RDNA architecture improves parallel processing at an instruction level by introducing a new feature called Asynchronous Compute Tunneling. Both GCN and the newer Navi GPUs support asynchronous compute (simultaneous execution of graphics and compute pipelines), but RDNA takes a step further. At times when one task (graphics or compute) becomes far more latency-sensitive than the other, Navi has the ability to completely suspend the latter.
In GCN based Vega designs, the command processor could prioritize compute over graphics and spend less time on shaders. In the RDNA architecture, the GPU can completely suspend the graphics pipelines, using all the resources for high-priority compute tasks. This significantly improves performance in the most latency-sensitive workloads such as virtual reality.
Scalar Execution for Control Flow
Continued on the next page…