One of the key highlights of the RDNA 3 graphics architecture is the ability to issue dual Wave32 instructions for twice the floating point (more specifically, FMA) throughput. However, from what we’ve gathered, this feature may have been overhyped by AMD’s marketing team. Each RDNA 3 Compute Unit features 64 multi-precision/multi-purpose ALUs distributed across two SIMD32 units alongside a vector matrix accelerator and an SIMD8 unit.
One of the SIMD32 units is capable of both INT and FP compute, in addition to Matrix, while the other can only process FP and Matrix instructions. Each of the SIMD32 vector units (pair) can execute one wave64 FMA or two wave32 instruction groups in a single clock cycle.
However, this is the absolute peak throughput, possible only on paper. In Wave32 mode, the two 32-wide FMA instructions have access to only one operand register (vGPR), instead of two and an intermediate shared value. Even when using Wave64 mode, the peak performance possible is only 5/6th of the theoretically predicted figure.
We reached out to AMD on the matter and got the following response:
Wave64 natively can access the new ALUs for 2x execution rates to unlock performance during dense ALU code execution. For Wave32 mode, the compiler does localized reordering and packing of instructions into the VOPD encoding. An RT test scene using VOPD encodings provided approximately a 4% increase in frames per second by removing the ALU bottleneck. We expect to see further improvements as the compiler matures with more optimizations for mapping code sequences to the VOPD encodings. And with the advances happening in the use of AI, RT, and compute-driven rendering techniques for more life-like rendering, we expect to see codes bound by ALU that will exploit these new ALUs more and more.AMD
AMD admits that the 64 ALUs in a Compute Unit can only double the execution throughput in Wave64 mode during dense ALU code execution. In Wave32 mode, the compiler handles the localized reordering and packaging of the instructions into VOPD encoding. However, a ray-traced scene using VOPD encoding provides a ¬4% increase in frame rates by removing the ALU bottleneck.
Team Red expects the gains from VOPD encodings to improve over time as the compiler matures with more and more optimizations for mapping code sequences to them.