AMD Ryzen 5000 “Zen 3” Architectural Deep Dive

AMD announced its Ryzen 5000 processors at the end of last month, promising massive improvements in gaming and single-threaded performance. With the Zen 3 core, the company claims an IPC boost of 19% over Zen 2 which is even higher than the 15% figure that the latter boasts over Zen. These gains are the result of a reworked chiplet topography, a wider execution backend, increased load-store bandwidth, and something AMD calls “no-bubble” branch prediction.

The Zen 3 CCX: 8 Cores w/ 32MB Shared L3 Cache

The most obvious change with Ryzen 5000 CPUs is a newer, wider CCX layout, featuring 8 cores instead of four, sharing the same 32MB of L3 cache. This supposedly improves the cache hit rates, reduces latency, cache bandwidth as well as decreasing the core-to-core latency. Each core has access to twice as much last-level cache, allowing them to access much more data compared to Zen 2. A larger cache generally also means higher latency, but AMD hasn’t said anything about that. We’ll do our own tests to verify that:

Running SiSoft’s cache bandwidth benchmark, you can see that for single-threaded workloads, Zen 3 provides notably higher bandwidth compared to Zen 2.

The multi-core bandwidth test is more interesting. As the amount of data being transferred increases, the delta between Zen 3 and Zen 2 increases and upon crossing the 4MB threshold, the Ryzen 5 3600X effectively has a higher cache bandwidth compared to the 5600X, roughly by around 200GB/s which is rather significant.

The latency benchmarks show similar results. In the in-page RAP test, the Ryzen 5 3600X becomes faster than the 5600X once you cross the data-set size of 128KB, with the delta being relatively small until 16MB. North of that, it grows to as much as 75 clocks. The average cache latency of the 5600X is 57-61ns while the 3600X is much faster at just 39ns.

The sequential access pattern (SAP) test shows the opposite results, with the 5600X performing just a tad bit better than the 3600X.

The inter-core latency and bandwidth, on the other hand, are a huge step up with the Zen 3 based processors. The Ryzen 5 5600X features a core-to-core latency of just 25-27ns with all the cores in the CCX, with a bandwidth of 30.66GB in multi-core and 67.64GB in multi-threaded tests.

In comparison, the Zen 2 based 3600X has an inter-core latency of 53-54ns (best-case) with the rest of the cores in the same CCX. The worse case is twice as much as the core is forced to communicate across the two CCXs. The bandwidth is also notably lower: 21GB/s for multi-core and around 35GB for the multi-threaded benchmark.

Zen 2 vs Zen 3 Front-End: Zero Bubble Branch Prediction and Faster Op-Cache

From a higher level, there are three primary changes to the Zen 3 front-end: an improved branch predictor, with faster sequencing and switching of the Op-cache fetches and pipes, respectively. Then, there’s the no-bubble prediction. This simply means that the core has a much faster pipeline flush in case of a branch misprediction which usually stalls the pipeline, otherwise known as a bubble. The L1 Branch Target Buffer (BTB) has also been doubled from 512 to 1,024 entries, further improving branch prediction.

The BTB primarily stores the program counter of branch instructions along with the PC target of the branch (next program-counter). Basically, the BTB is used to predict whether the branch is conditional or unconditional as well as the target of the branch. With the former, the target direction (if/else) is predicted while in the case of the latter, the target is taken. A larger BTB simply allows for more branch instructions to be fetched as well as an increased number of recorded branches.

Overall, the Zen 3 front-end is largely unchanged. The branch prediction has seen some optimizations, but the basic formula is the same. We still have the same TAGE branch predictor, with an L1 BTB with 1024 entries (2x vs Zen 2) and the L2 BTB with 6.5K entires. Although the branch prediction has been tweaked for better accuracy, the primary performance gains will come in cases of mispredictions as the recovery time (zero-bubble) is much faster compared to Zen 2.

Zen 3 Back-end: Wider Execution Windows and Improved Load/Store

The back-end has seen some extensive overhaul. This is where most of that 19% IPC gain comes from. Both the Integer and Floating-Point Execution engines have been expanded and fine-tuned while the load-store engine has also been buffed up as well.

On the Integer side, there have been minor increases on all the fronts: The scheduler has gone up to 96 from 92, the physical register can now hold 192 entries (up from 180) while the Re-order buffer has also been increased from 224 to 256.

Although the number of Int instructions issued has been increased to 10 (4 ALUs, 3 AGUs, 1 BR and 2 Stores), up from 7, the total number of execution units is the same. There are still three Address Generation Units (AGUs), one shared with a Store Unit, and four INT execution units (ALUs), one shared with a Store Unit and one with a branch unit. In line with the other improvements to branch prediction, the Int Execution gets a dedicated BR (branch execution unit).

With Zen 2, the BR ports were shared with the ALUs. Another notable change to the integer execution is the inclusion of wider, but fewer schedulers. According to AMD, this improves the ALU/AGU utilization by balancing the load across the various ports.

The Floating Point Execution has gotten the same treatment. The scheduler is wider, the dispatch bandwidth is higher and the time to execute FMAC instructions has also been reduced from 5 cycles to 4. Like the Integer Execution Engine, the FP Execution has also been granted more ports. There is a dedicated port for F2I instructions that move data from the floating-point registers to the integer registers directly (without using memory), and another shared with a Store Unit. The MUL/MAC ports have also been separated from the ADD units for improved utilization and faster execution. The FP scheduling queue has also grown from 36 to 64.

The number of load/stores has also been increased from 2 Loads + 1 Store to 3 Loads or 2 Stores. On the integer side, the Zen 3 core can do 3 loads per cycle while the FP side can get only 2x 256b loads per cycle (or 1x 256b store).

Although the load/store bandwidths have effectively doubled, the L1 to L2 cache transfer speeds are unchanged at 32 bytes/cycle x 2. The L1 fetch width is also the same as Zen 2 at 32 bytes.

There are an additional of four TLB walkers to keep up with the faster load/stores, with improved prefetching across page boundaries and a more accurate prediction of load-to-store forward dependencies.

The CCD to cIOD bandwidth is the same as Zen 2 with transfer rates of 32 bytes per cycle per CCX/CCD, although the maximum memory controller frequency has been raised to 2000MHz. Therefore, you can now run your IF and RAM in a 1:1 mode up till 4000MT/s, but most will only be able to manage only 3733 or 3800MT/s. Essentially, the sweet-spot has been lifted from 3600MT/s to 3800MT/s, with 4000MT/s being the upper limit.

Conclusion: Taking the Gaming Crown from Intel

With Zen 2, AMD already had a considerable lead over Intel in most workloads from content creation to encoding and even the standard LZMA compression algorithms, thanks to the higher thread counts. However, gaming was one area where the Ryzen chips still lagged behind their Intel counterparts by a noteworthy margin. With Zen 3, AMD aims to change that, begrudging Intel of the one advantage it still has in consumer workloads. To find out how successful Team Red has been in this endeavor, check our review of the Ryzen 5 5600X.

Areej Syed

Processors, PC gaming, and the past. I have been writing about computer hardware for over seven years with more than 5000 published articles. Started off during engineering college and haven't stopped since. Mass Effect, Dragon Age, Divinity, Torment, Baldur's Gate and so much more... Contact: areejs12@hardwaretimes.com.
Back to top button