AMD’s GCN architecture powered Radeon graphics cards for almost a decade. Although the design has its strengths such as a powerful Compute Engine, hardware schedulers, and unified memory, it wasn’t very efficient. The hardware utilization was quite poor compared to rival NVIDIA parts, scaling dropped sharply after the first 11 CUs per shader engine and overall, more than 64 CUs per GPU wasn’t feasible.
In gaming especially, the IPC and latency were below the industry standard which is why AMD GPUs repeatedly failed to keep with NVIDIA’s high-end products. The RNDA based Navi GPUs aim to rectify the drawbacks of GCN. The Radeon RX 5000 series graphics cards are built from the ground up for gaming and feature a much more refined pipeline.
RDNA is the core architecture and Navi is the codename of the graphics processors built using it. Similarly, GCN was the architecture and Vega|Polaris was the codename of the GPUs.
The 1st Gen RDNA architecture powering the Navi 10 and Navi 14 GPUs (Radeon RX 5500, 5600 XT and 5700/XT) are based on the same building blocks as GCN: A vector processor with a few dedicated scalars for address calculation and control flow, separate compute and graphics pipelines running asynchronously. ALUs called stream processors or vectors provide the computational power, and the Command Processor (along with the ACEs) handles the workload scheduling per Compute Unit. The core difference is that RDNA reorganizes the fundamental components of GCN for a higher IPC, lower latency and better efficiency. That’s what Navi is all about: It does a lot more with notably less hardware!
Dual Compute Architecture and Wave32
AMD’s GCN graphics architecture consisted of 64 wavefronts or work-items (and ALUs/cores) per Compute Unit. These were divided into four SIMDs (Single Instruction On Multiple Data Types), each packing 16 ALUs. Every SIMD took four clock cycles to complete one full wavefront. When all the units execute the same instruction, this is very efficient, as they’d finish their workloads simultaneously. However, unlike CPUs, GPUs have to process diverse workloads.
And often, one wave consisted of multiple kinds of work-items, some of them requiring four clock cycles to execute, many of them needed just one of two. This left much of the SIMD underutilized, making it hard to fully saturate.
The RDNA architecture implemented in Navi uses wave32, a narrower wavefront with 32 work-items. It is much simpler and efficient than the older wave64 design. Each SIMD consists of 32 shaders or ALUs, twice that of GCN. There are two SIMDs per CU and four in a Dual Compute Unit. The total number of stream processors in a CU is still 64, but they are distributed across two (not four) wider SIMDs. Refer to Four cycles vs one cycle per instruction diagram further below.
This arrangement allows the execution of one whole wavefront in one clock cycle, reducing bottlenecks and boosting IPC by 2x. By completing a wavefront 4x faster, the registers and cache are freed up much faster, allowing quicker scheduling between consecutive instructions. Furthermore, wave32 uses half the number registers as wave64, reducing circuit complexity and costs too.
To accommodate the narrower wavefronts, the vector register file has also been reorganized. Each vector general-purpose register (vGPR) now contains 32 lanes that are 32-bits wide, and an SIMD contains a total of 1,024 vGPRs – again, 4X the number of registers as in GCN.
Overall, the narrower wave32 mode increases the throughput by improving the IPC and the total number of concurrent wavefronts, resulting in significant performance and efficiency boosts.
To ensure compatibility with the older GCN instruction set, the RDNA SIMDs in Navi support mixed-precision compute. This makes the new Navi GPUs suitable for not only gaming workloads (FP32), but also for scientific (FP64) and AI (FP16) applications. The RDNA SIMD improves latency by 2x in wave32 mode and by 44% in wave64 mode.
Asynchronous Compute Tunneling
One of the main highlights of the GCN architecture was the use of the Asynchronous Compute Engines way before NVIDIA integrated it into their graphics cards. RDNA retains that capability and doubles down on it.
The Command Processor handles the commands from the API and then issues them to the respective pipelines: The Graphics Command Processor manages the graphics pipeline (shaders and fixed-function hardware) while the Asynchronous Compute Engines (ACE) take care of the Compute. The Navi 10 die (RX 5700 XT) has one Graphics Command Processor and four ACEs. Each ACE has a distinct command stream while the GCP has individual streams for every shader type (domain, vertex, pixel, raster, etc).
The RDNA architecture improves parallel processing at an instruction level by introducing a new feature called Asynchronous Compute Tunneling. Both GCN and the newer Navi GPUs support asynchronous compute (simultaneous execution of graphics and compute pipelines), but RDNA takes a step further. At times when one task (graphics or compute) becomes far more latency-sensitive than the other, Navi has the ability to completely suspend the latter.
In GCN based Vega designs, the command processor could prioritize compute over graphics and spend less time on shaders. In the RDNA architecture, the GPU can completely suspend the graphics pipelines, using all the resources for high-priority compute tasks. This significantly improves performance in the most latency-sensitive workloads such as virtual reality.
Scalar Execution for Control Flow
Most of the computation in AMD’s GCN and RDNA architectures is performed by the SIMDs which happen to be vector in nature: perform single instruction on multiple data types (32 INT/32 FP executed per SIMD per cycle, simultaneously). However, there are scalar units in each CU as well.
Each SIMD contains a 10KB scalar register file, with 128 entries for each of the 20 wavefronts. A register is 32-bits wide and can hold packed 16-bit data (integer or floating-point) and adjacent register pairs hold 64-bit data. The scalars are used for address generation for the load/store units and manage the SIMD control flow.
When a wavefront is initiated, the scalar register file can preload up to 32 user registers to pass constants, avoiding explicit load instructions and reducing the launch time for wavefronts.
The 16KB write-back scalar cache is 4-way associative and built from two banks of 128 cache lines that are 64B. Each bank can read a full cache line, and the cache can deliver 16B per clock to the scalar register file in each SIMD. For graphics shaders, the scalar cache is commonly used to stored constants and work-item independent variables.
Cache: L0 & Shared L1
While the old GCN and rival NVIDIA GPUs rely on two levels of cache: RDNA adds a third L1 cache in the Navi GPUs. Where the L0 cache is private to a CU, the L1 cache is shared across a group of Dual Compute Units. This reduces costs, latency and power consumption. It reduces the load on the L2 cache. In GCN, all the cache misses of the per-core L1 cache were handled by the L2 cache. In RDNA, the new L1 cache centralizes all caching functions within each shader array.
Any cache misses that happen in the L0 caches pass to the L1 cache. This includes all the data from the instruction, scalar and vector caches, in addition to the pixel cache. L1 is a read-only cache and each is composed of four banks, resulting in a total of 128KB. It is a 16-way set-associative cache memory. The L1 cache is backed by the L2; a write to L1 will be invalidated and copied to L2 or memory.
The L1 cache controller coordinates memory requests and forwards four per clock cycle, one to each L1 bank. Like in any other cache memory, L1 misses are serviced by the L2 cache.
Dual Compute Unit Front End
Each Compute Unit fetches instruction via the Instruction Memory Fetch. In GCN, the instruction cache was shared between four CUs, but in RDNA (Navi), the L0 instruction cache is shared amongst the four SIMDs in a Dual CU. The instruction cache is 32KB and 4 -way set-associative. Like the L1 cache, it is organized into four banks of 128 cache lines, each 64-bytes long.
The fetched instructions are deposited into the wavefront controllers. Each SIMD has a separate instruction pointer and a 20-entry wavefront controller, for a total of 80 wavefronts per dual compute unit. Wavefronts may be different from a work-group or kernel. Although there may be a higher number of wavefronts fetched, a dual compute unit runs only 32 workgroups simultaneously.
As already mentioned, where GCN requested instructions once every four cycles, Navi does it every cycle (2-4 ins per cycle). After that, each SIMD in an RDNA based Navi GPU can decode and issue instructions every cycle as well, increasing the throughput and reducing latency by 4x over GCN.
To accommodate the new wave32 mode, the cache and memory pipeline in each RDNA SIMD has also been revamped. The pipeline width has been doubled compared to GCN based Vega GPUs. Every SIMD has a 32-wide request bus that can transmit the address for a work-item in a wavefront directly to the ALUs or the vGPRs (Vector General Purpose Registers).
A pair of SIMDs share a request and return bus, however, a single SIMD can receive two chunks of 128-byte cache lines per clock: one from the LDS (Load-Store) and the other from the Vector L0 cache.
Video Encode and Decode
Like NVIDIA’s Turing encoder, the Navi GPUs also feature a specialized engine for video encoding and decoding.
In Navi 10 (RX 5600 & 5700), unlike Vega, the video engine supports VP9 decoding. H.264 streams can be decoded at 600 frames/sec for 1080p and 4K at 150 fps. It can simultaneously encode at about half the speed: 1080p at 360 fps and 4K at 90 fps. 8K decode is available at 24 fps for both HVEC and VP9.
7nm Process and GDDR6 Memory Standard
While the 7nm node and GDDR6 memory are often advertised as part of the new architecture, these are third-party technologies and aren’t exactly part of the RDNA micro-architecture. They are supported and the GPUs are designed to work with them.
TSMC’s 7nm node does, however, improve the performance per watt significantly over the older 14nm process powering the older GCN designs, namely Polaris and Vega. It increases the performance per area by 2.3x and the performance per watt metric is boosted by 1.5x.
As you can see, while RDNA and Navi don’t exactly reinvent the Radeon design, they mainly just refine it. The pipeline bottlenecks have been removed, latency has been decreased and every SIMD is now wider and faster. There are more Render Backends per Shader Engine, with three levels of unified cache, a major step up from the preceding Vega GPUs. It’ll be interesting to see how different RDNA 2 will be from the existing Navi GPUs. To be honest, I don’t think there will be any radical changes. There might be some dedicated cores for ray-tracing acceleration or upscaling and that’s about it. What AMD needs to work on is their software and drivers.