What is the difference between NVIDIA and AMD graphics cards? This is one of the most commonly asked questions. In this post, we have a look at the latest GeForce and Radeon architectures and compare the two: the similarities, differences and everything in between. AMD’s latest microarchitecture is the RDNA 1.0 used in the Navi 10 based RX 5600 and 5700 series GPUs. NVIDIA’s newest design is the Turing microarchitecture. At a high level, both the GPUs have the same basic design, but then finer details highlight a vastly different approach to performing the same task.
AMD Radeon vs NVIDIA GeForce: Introduction
If you look at NVIDIA and AMD’s GPU architectures from a higher level, they both consist of the same components and perform more or less the same operations. You’ve got the GPU containing execution units, fed by the schedulers and dispatchers. Then there is the cache memory connecting the GPU to the graphics memory and post-processing units, Texture Units, Render Output Units and Rasterizers performing the last set of operations before sending the data to the display.
If you magnify the above image and have a closer look at the execution units, the cache hierarchy, and the graphics pipelines, that’s where everything becomes complicated:
AMD Navi vs NVIDIA Turing GPU Architectures: SM vs CU
One of the main differences between NVIDIA and AMD’s GPU architectures is with respect to the cores/shaders and Compute Units (NVIDIA calls it SM or Streaming Multiprocessor). NVIDIA’s shaders (execution units) are called CUDA cores while AMD uses stream processors.
NVIDIA’s SM has FP32 cores, INT32, and two tensor cores, in addition to the load/store, Special function unit and the warp scheduler and dispatch. Because of the inclusion of separate shaders for INT and FP, NVIDIA SMs can execute both floating-point and integer instructions per cycle.
AMD’s Dual CUs, on the other hand, consist of four SIMDs, each containing 32 shaders or execution lanes. There are no separate shaders for INT or FP, and as a result, the Navi stream processors can run either FP or INT per cycle. However, unlike the older GCN design, the execution happens every cycle, greatly increasing the throughput.
Turing vs Navi: Graphics and Compute Pipeline
In AMD’s Navi architecture, the Graphics Command Processor takes care of the regular graphics pipeline (rendering, pixel, vertex, and hull shaders), the ACE (Asynchronous Compute Engine) issues Compute tasks via separate pipelines. These work along with the HWS (Hardware Schedulers) and the DMA (Direct Memory Access) to allow concurrent execution of compute and graphics workloads.
In Turing, the wrap scheduler along with the Gigathread engine manages both Compute and graphics workloads. It can simultaneously execute both workloads due to the existence of different cores for INT and FP.
In Team Red’s case, the Graphics Command Processor issues graphics commands while the ACE handles compute workloads. These are issued concurrently in the form of a group of threads called waves. Each wave includes 32 threads (one for each shader in the SIMD), either compute or graphics and are sent to Dual Compute Units for execution. Since each CU has four SIMDs, it can handle two waves while a Dual Compute Unit can process four.
In NVIDIA’s case, the Gigathread Engine with the help of the Warp Schedulers manages thread scheduling. Each collection of 32 threads is called a warp. As there are four warp schedulers in every SM with their individual INT32 and FP32 core clusters, each Streaming Multiprocessor can handle four 32 thread warps.
Green vs Red: Cache Hierarchy
With the new RDNA based Navi design, AMD has been rather generous with the cache memory. By adding another block of L1 cache between L2 and L0, the latency has significantly improved over GCN. The L0 cache is exclusive to a Dual Compute Unit while the L1 cache is shared between four DCUs. A larger block of 4MB L2 cache is accessible globally to each CU.
NVIDIA’s Turing L2 cache size is notably larger than larger the Navi but then again there’s no intermediate in between complementing the shader cache. The L1 cache is reconfigurable as per workloads and there is one block of 96KB L1 cache per SM. The L2 cache is common across all SMs.
Rasterizers, Tesselators and Texture Units
Other than the Execution Units, Cache and the Graphics Engines, there are a few other components such as the Rasterizers, Tesselators, Geometry Processor, Texture Units and Render Backend. These components perform the final steps of the graphics pipeline such as depth effects, texture mapping, tesselation, and rasterization.
Each Compute Unit in the Navi GPUs (and Turing SM for NVIDIA) contains four TMUs. There are two rasterizers per shader engine for AMD and one for every GPC (Graphics Processing Cluster) in the case of the Turing GPU block. In AMD’s Navi, there are also RBs (Render Backends) that handle pixel and color blending, among other post-processing effects.
With Turing, NVIDIA turned over the responsibilities of the individual shaders like the vertex, hull, and tesselation over to the new mesh shader. This allows for lower CPU draw calls per scene and a higher polygon count. AMD, on the other hand, has doubled down on that front by adding a geometry processor and culling unnecessary tessellation and other geometry.
Process Nodes and Conclusion
As already mentioned in the beginning, there is another major difference between the NVIDIA Turing and AMD Navi GPU architectures, and that is with respect to the process node. While NVIDIA’s Turing TU102 die is much bigger than Navi 10, the number of transistors per unit mm2 is higher for the latter.
This is because AMD’s Navi architecture leverages the newer 7nm node from TSMC. NVIDIA, on the other hand, is still using the older 14nm process. Despite that though, NVIDIA GPUs are more energy-efficient than competing Radeon RX 5700 series graphics cards.
Thanks to the 7nm node, AMD has significantly reduced the gap but it’s still a testament to how efficient NVIDIA’s GPU architecture really is.
Video Encode and Decode
Both the Turing and Navi GPUs feature a specialized engine for video encoding and decoding.
In Navi 10 (RX 5600 & 5700), unlike Vega, the video engine supports VP9 decoding. H.264 streams can be decoded at 600 frames/sec for 1080p and 4K at 150 fps. It can simultaneously encode at about half the speed: 1080p at 360 fps and 4K at 90 fps. 8K decode is available at 24 fps for both HVEC and VP9.
For streamers, Turing had a big surprise. The Turing video encoder allows 4K streaming while exceeding the quality of the X264 encoder. 8K 30FPS HDR support is another sweet addition. This is an advantage over Navi only in theory though. No one streams at 8K.
Two more features that come with Turing are Virtual Link and NVLink SLI. The former combines the different cables needed to connect your GPU to a VR headset into one while the latter improves SLI performance by leveraging the high bandwidth of the NVLink interface.
VirtualLink supports up to four lanes of High Bit Rate 3 (HBR3) DisplayPort along with the SuperSpeed USB 3 link to the headset for motion tracking. In comparison, USB-C only supports four lanes of HBR3 DisplayPort or two lanes of HBR3 DisplayPort + two lanes SuperSpeed USB 3.