The launch of the 3rd Gen Ryzen processors on 7th July 2020 was a key moment in AMD’s history. For the first time in over a decade, the company was offering comparable single-threaded and considerably higher multi-threaded performance than contemporary Intel chips. Thanks to TSMC’s cutting-edge 7nm FinFet node, AMD also took the process lead, designing the most efficient processors on the market.
How did AMD go from being the “manufacturer of cheap, budget CPUs” to
The Older Bulldozer and FX Processors
Let’s clear up a few things. Zen is good but it wouldn’t have been such a step up for AMD if the older Bulldozer design wasn’t flawed in so many ways. After the third-gen K10 architecture, AMD acquired ATI Technologies which drained the company funds, preventing it from investing in a new CPU process.
Instead of trying to improve the existing architecture, the team decided to invest in a narrow, low-IPC, high-clock design, codenamed Bulldozer. While on paper, the core counts seemed promising, the single-threaded performance was severely affected. The result: AMD ended up losing its competitive edge and the god-awful FX processors were born.
You needed high core counts and higher operating clocks to make this work which isn’t something AMD was able to pull off, and the rest is history. To give a clearer picture of how disadvantaged the Bulldozer chips were compared to their predecessors, here’s an example: To offer performance in line with the older Phenom II processors, Bulldozer needed to have a 30-40% higher operating speed (on average). This, of course, didn’t happen, and instead resulted in power-hungry CPUs that ran hot and unstable.
The company started to move in the right direction with Steamroller and Excavator but hit a roadblock soon after. The design limitations of the Bulldozer architecture prevented the engineers from making further improvements without overhauling the layout.
This finally resulted in the brand new Zen architecture that ditched all the bottlenecks of Bulldozer and its successors, and here we are today, with the 3rd Gen Ryzen lineup leveraging the Zen 2 microarchitecture based on the 7nm node.
Read more here:
Intel’s CPUs are Monolithic
Intel follows what’s called a monolithic approach to processor design. What this means, essentially, is that all cores, cache, and I/O resources for a given processor are physically on the same die or chip. There are some clear advantages to this approach. The most notable is reduced latency. Since everything is on the same physical substrate, different cores take much less time to communicate, access the cache, and access system memory. Latency is reduced. This leads to optimal performance.
If everything else is the same, the monolithic approach will always net you the best performance. There’s a big drawback, though. This is in terms of cost scaling.
When CPUs are fabbed (or any piece of silicon for that matter), you almost never manage 100 percent yields. Yields refer to the proportion of usable parts made. If you’re on a mature process node like Intel’s 14nm+++, your silicon yields will be in excess of 90 percent. This means you get a lot of usable CPUs. The inverse, though, is that for every 10 CPUs you manufacture, you have to discard at least one defective unit. The discarded unit obviously cost money to make, so that cost has to factor into the final selling price.
At low core counts, a monolithic approach works fine. This in large part explains why Intel’s mainstream consumer CPU line has, until recently, topped out at 4 cores. When you increase core count though, the monolithic approach results in exponentially greater costs. Why is this?
On a monolithic die, every core has to be functional. If you’re fabbing an eight-core chip and 7 out of 8 cores work, you still can’t use it. Remember what we said about yields being in excess of 90 percent? Mathematically, that ten percent defect rate stacks for every additional core on a monolithic die, to the point that with, say a 20-core Xeon, Intel actually has to throw away one or two defective chips for every usable one, since all 20 cores have to be functional. Costs don’t just scale linearly with core count–they scale exponentially because of wastage.
The consequence of all this is that Intel’s process is price and performance competitive at low core counts, but just not tenable at higher core counts unless they sell at thin margins or at a loss.
AMD Ryzen 3000 and Epyc Rome: The Chiplet Approach (Multi-Chip Module)
AMD adopts a chiplet-based approach to processor design. Think of each Ryzen CPU as multiple processors stuck together with superglue–Infinity Fabric in AMD parlance. One Ryzen chiplet (called a CCX) features a 4-core 8-thread processor, together with its L3 cache. Two CCXs are stuck together on a CCD to create the fundamental building block of Zen-based Ryzen and Epyc CPUs. Up to 8 CCDs can be stacked on a single MCM (multi-chip module), allowing for up to 64 cores in consumer Ryzen processors such as the Threadripper 3990X.
There are two big advantages to using the MCM design. For starters, costs scale more or less linearly with core counts. Because AMD’s wastage rate is relative to its being able to create a functional 4-core block at most (a single CCX), they don’t have to throw out massive stocks of defective CPUs. The second advantage comes from their ability to leverage those defective CPUs themselves. Whereas Intel just throws them out, AMD disables functional cores on a per-CCX basis to achieve different core counts.
However, there’s a catch. The Infinity Fabric induces a latency delay when communicating with the cores and the cache on the other chiplets. This has a drastic impact on applications that are latency-sensitive such as gaming, one of the reasons Intel chips tend to be better at it than rival Ryzen parts.
AMD managed to deal with the latency drop induced by the Infinity Fabric by doubling the L3 cache (calling it GameCache). By storing more data on the chip cache, the cache miss rate is reduced, thereby improving the latency by up to 33ns.
According to AMD’s official figures, this improved the 1080p gaming performance by 21% on average. Zen 2’s cache hierarchy doubles the L1 load/store bandwidth compared to Zen and many latency-sensitive applications also get a healthy boost.
Central I/O Die and CCDs
The Zen 2 design includes one CCD (dies) for the Ryzen 5 and 7 processors while two for the Ryzen 9 chips. However, all the CPUs use the same 12nm I/O die connected to each CCD. It acts as the central hub for all off-chip communications including memory, PCIe lanes, etc.
There are a bunch of other factors that make AMD’s Ryzen 3000 CPUs so competitive and efficient. Let’s have a look:
A major chunk of the efficiency of the Zen 2 core comes from TSMC’s cutting edge 7nm node. The rest is divided between the improved IPC (instructions per cycle) and the design optimizations.
Load Store and Cache
The L1 instruction cache has underdone a makeover with Zen 2. It has been reduced from 64KB to 32KB but the associativity has been increased from 4-way to 8-way.
The Store Queue and L2 Data Translation Look Aside Buffers have been widened from 44 to 48 and 1.5K to 2K (faster too), respectively. This has reduced latency by 1 cycle.
The L1D cache width has also been doubled, increasing the reads/writes from 128 bits to 256 bits. All these factors result in a 3x increase in the load-store bandwidth (32 bits/clock to 96/clock). Prefetch throttling has been streamlined as well, but AMD didn’t detail the improvements.
All modern processors include a branch predictor. It essentially tries to “guess” the outcome of a particular operation. AMD’s Ryzen 3000 chips feature the TAGE branch predictor.
For L1 based predictions, the Hashed Perceptron branch predictor is used while TAGE is used for L2. The main highlight is that the latter uses additional tagging for expand branch histories. This is mainly because the HP predictor is used for short prefetches while TAGE is used for longer ones.
2x Micro-Op Cache
AMD has also doubled the micro-op cache size from 2K to 4K in Zen 2. In comparison, the 14nm Skylake core has just 1.5K and even the newer 10nm Sunny Cove design is limited to 2.25K entires.
Micro-operations are decoded functions pending execution in the CPU pipeline and are stored in the micro-op cache for recycling.
2x Floating Point Bandwidth
The Zen 2 chips boast superior floating-point performance. The Floating-Point and Load Store bandwidth has been doubled from 128-bit to 256-bit. This means that AVX2 operations can now be processed without breaking them down into two 128-bit micro-ops. There are four queues in the floating-point unit, two add and two multiply. Multiply or mul is essentially repeated addition. The Zen 2 core boats a faster mul execution speed of 3 cycles (down from 4).
Wider and Deeper Integer Execution
The integer execution units (ALUs) are unchanged but the Address Generation Units (AGUs) have been increased to 3. This results in a wider 7 lane integer pipeline. However, the major improvements here come from wider schedulers and larger buffers. The integer scheduler has been increased to sixteen lanes (total four) while the AGU scheduler has been doubled in bandwidth and consolidated into a single unit. The register rename and reorder buffers have also been expanded by around 10-15%.
Infinity Fabric 2
The Infinity Fabric 2 featured on the Ryzen 3000 CPUs is also a major step up from the older IF1. The bus width has been doubled from 256-bit to 512-bit and support for PCIe 4.0 has been added. AMD claims that the Zen 2 Infinity Fabric is 27% more efficient than its predecessor.
One major change that the memory and IF configuration has undergone is that they can be decoupled now. You can run the memory and the IF in the same standard mode (1:1), or you can run the IF at half the speed (2:1). This allows overclockers to push 5GHz but isn’t practical in real life scenarios.
Keeping the DRAM clock high, but the IF frequency at half will, in the end, lead to worse performance. 3733MHz in 1:1 mode is the optimal config and AMD recommends keeping the ratio at a 1:1 around DDR4-3600. It’s better to focus on the memory timings at that point.
Power Efficiency, Faster Boosts Security Mitigations
The Windows 10 May 2019 update brought many changes to the Ryzen boosting behavior. Compared to earlier, the peak boost speeds can now be sustained much faster (up to 30x). However, it’ll be harder to stay at these frequencies if you are on an air-based cooler. The boost clock behavior of the Zen 2 chips is very similar to NVIDIA’s GPU Boost. They are quite sensitive to temperature and power.
The cores not only boost in an instant but they go back to their idle states in less than an instant too. When your PC is idle, the CPU will scale down to a low-power state, drawing as low as 10-20W of energy. Paired with the efficient 7nm node, this has made AMD’s Ryzen 3000 CPUs much more power efficient compared to Intel’s 9th Gen Coffee Lake parts.
2020 started with the reveal of the x86 architecture vulnerabilities. Intel’s products were badly affected while AMD got away with a few scratches and bruises. For team blue, most of the issues have been dealt with via software and firmware updates (at the cost of performance) but AMD’s new Ryzen 3000 arsenal offers protection from all known vulnerabilities at a hardware level.
Conclusion: MCM Design + 7nm Node= Win
One of the main takeaways from the Ryzen 3000 launch is the adoption of the chiplet or MCM (Multi-Chip Module) approach. It improves scalability, reduces costs and offers more cores to the average consumer without increasing the price-tag. Moore’s Law may have “slowed-down” but looking at AMD’s MCM design and packaging strategies, it’s clear that the company is looking to score big in the coming years. Intel, on the other hand, continues to release revisions of the 14nm Skylake core and reminds everyone that their “real-world” performance is superior.