Intel and AMD have been the two primary processor companies for more than 50 years now. Although both use the x86 ISA to design their chips, over the last decade or so, their CPUs have taken completely different paths.
In the mid-2000s, with the introduction of the Bulldozer chips, AMD started losing ground against Intel. A combination of low IPC and inefficient design almost drove the company into the ground. This continued for nearly a decade. The tables started turning in 2017 with the arrival of the Zen microarchitecture.
The new Ryzen processors marked a complete re-imagining of AMD’s approach to CPUs, with a focus on IPC, single-threaded performance, and, most notably, a shift to an MCM or modular chiplet design. Intel, meanwhile, continues to do things more or less exactly as they’ve done since the arrival of Sandy Bridge in 2011.
It all Started with Zen
First and second-gen Ryzen played spoiler to Intel’s midrange efforts by offering more cores and more threads than parts like the Core i5-7600K. But a combination of hardware-side issues like latency, and a lack of Ryzen-optimized games meant that Intel still commanded a significant performance lead in gaming workloads. This all changed with the arrival of the Ryzen 3000 series, based on the 7nm Zen 2 design.
A drastic improvement to IPC meant that AMD was able to offer more cores, but also match Intel on single-threaded performance. Buying into Skylake refresh-refresh-refresh-refresh wouldn’t necessarily net you better framerates. Intel’s counter so far has been to offer (wait for it) more cores and threads at every price-point. A 10th-gen i3 now performs better than the 7th Gen Kaby Lake Core i7-7700K. And top-tier i9 parts have a core count previously only seen on Xeons. The processor market is changing and a lot of it has to do with Intel’s and AMD’s divergent approaches to processor design.
AMD and Intel are on fundamentally different paths in their processor design philosophy. Here’s an annoying elementary school analogy that might help you understand the difference. Which one’s more fruit: A watermelon or a kilo of apples? One’s a really big fruit. And the other’s, well, a lot of small fruit. You’ll want to keep that in mind as we take a deep dive here in the next section.
Intel Monolithic Processor Design vs AMD Ryzen Chiplets
Intel follows what’s called a monolithic approach to processor design. What this means, essentially, is that all cores, cache, and I/O resources for a given processor are physically on the same monolithic chip. There are some clear advantages to this approach. The most notable is reduced latency. Since everything is on the same physical substrate, different cores take much less time to communicate, access the cache, and access system memory. Latency is reduced. This leads to optimal performance.
If everything else is the same, the monolithic approach will always net you the best performance. There’s a big drawback, though. This is in terms of cost and scaling. We need to take a quick look now at the economics of silicon yields. Strap in: things are going to get a little complicated.
Monolithic CPUs Offer Best Performance but are Expensive and…
When fabricators manufacture CPUs, (or any piece of silicon for that matter) they almost never manage 100 percent yields. Yields refer to the proportion of usable parts made. If you’re on a mature process node like Intel’s 14nm+++, your silicon yields will be in excess of 90 percent. This means you get a lot of usable CPUs. The inverse, though, is that for every 10 CPUs you manufacture, you have to discard at least one defective unit. The discarded unit obviously cost money to make, so that cost has to factor into the final selling price.
At low core counts, a monolithic approach works fine. This in large part explains why Intel’s mainstream consumer CPU line has, until recently, topped out at 4 cores. When you increase core count though, the monolithic approach results in exponentially greater costs. Why is this?
On a monolithic die, every core has to be functional. If you’re fabbing an eight-core chip and 7 out of 8 cores work, you still can’t use it. Remember what we said about yields being in excess of 90 percent? Mathematically, that ten percent defect rate stacks for every additional core on a monolithic die, to the point that with, say a 20-core Xeon, Intel actually has to throw away one or two defective chips for every usable one, since all 20 cores have to be functional. Costs don’t just scale linearly with core count–they scale exponentially because of wastage.
Furthermore, when expanding the 14nm capacity, the newly started factories won’t have the same level of processor yields as existing ones. This has already led to Intel’s processor shortages and the resulting F series CPUs.
The consequence of all this is that Intel’s process is price and performance competitive at low core counts, but just not tenable at higher core counts unless they sell at thin margins or at a loss. It’s arguably cheaper for them to manufacture dual-core and quad-core processors than it is for AMD to ship Ryzen 3 SKUs. We’ll get to why that is now.
Chips, Chiplets and Dies
AMD adopts a chiplet-based or MCM (Multi-chip Module) approach to processor design. It makes sense to think of each Ryzen CPU as multiple discrete processors stuck together with super glue–Infinity Fabric in AMD parlance.
One Ryzen CCX features a 4-core 8-thread processor, together with its L3 cache. Two CCXs are stuck together on a CCD to create a chiplet, the fundamental building block of Zen-based Ryzen and Epyc CPUs. Up to 8 CCDs can be stacked on a single MCM (multi-chip module), allowing for up to 64 cores in consumer Ryzen processors such as the Threadripper 3990X.
There are two big advantages to this approach. For starters, costs scale more or less linearly with core counts. Because AMD’s wastage rate is relative to its being able to create a functional 4-core block at most (a single CCX), they don’t have to throw out massive stocks of defective CPUs. The second advantage comes from their ability to leverage those defective CPUs themselves. Whereas Intel just throws them out, AMD disables functional cores on a per-CCX basis to achieve different core counts.
For example, both the Ryzen 7 3700X and 3600 feature a single CCD (or two CCXs) with eight cores. The 3600 has one core on each CCX disabled, giving it 6 functional cores instead of eight. Naturally, this allows it to sell six-core parts at more competitive prices than Intel.
There is a big drawback to the chiplet approach: latency. Each chiplet is on a separate physical substrate. Because of the laws of physics, this means that Ryzen CPUs incur a latency penalty for communication over the Infinity Fabric. This was most noticeable with first-gen Ryzen. Infinity Fabric speeds correlated to memory clocks and overclocking your memory, therefore, resulted in noticeably faster CPU performance.
AMD managed to rectify this with the Ryzen 3000 CPUs by using what it calls “game cache,” which is really just marketing speak for a gigantic L3 cache. L3 cache is the intermediary between system memory and low-level CPU core cache (L1 and L2). Typical processors have a small amount of L3–Intel’s i7 9700K, for instance, has just 12 MB of L3. AMD, however, paired the 3700X with 32 MB of L3 and the 3900X with a whopping 64 MB of L3.
The L3 cache is spread evenly between different cores. The increased amount of cache means that, with a bit of intelligent scheduling, cores can cache more of what they need. The buffer eliminates most of the latency penalty incurred over Infinity Fabric. Consequently, Ryzen 3000 delivers equal or better performance than Coffee Lake in almost every workload, excluding gaming.
Chiplet or Monolithic: Which is Better?
There is no right or wrong to the approaches Intel and AMD have adopted. However, the chiplet approach is likely something we’re going to see more of in the years to come. This is because Moore’s law–which mandated a doubling in processing power every couple years–has comprehensively slowed down. Individual processor cores aren’t getting twice as fast every two years. The only answer to improved performance, then, is to go wide and stack cores.
Intel vs AMD: Skylake vs Zen 2
The core microarchitectures of the latest Intel and AMD chips are also quite distinct:
- Right off the bat, the branch predictors are completely different. AMD’s Zen 2 core leverages the Hashed Perceptron branch predictor for L1 based predictions while for L2, the new TAGE predictor is utilized. Intel’s Sunny Cove and Skylake architectures both use the core class predictor. There’s a standard 32-bit L1 predictor and a larger L2 predictor. Unfortunately, unlike AMD, Intel keeps the finer details of its prefetchers and predictors secret, so we don’t know much about them. Regardless, the two companies have very different approaches to branch prediction.
- The instruction fetch for Zen 2 is slightly wider at 32 bytes while Skylake and Sunny Cove use a narrower 16-bit fetch. The L1 instruction cache for all three designs is pegged at 32KB, but Sunny Cove upgrades the data cache by 50% to 48KB (12-way) while Zen 2 and Skylake are limited to 32KB (8-way).
- As far as the decoders are concerned, Intel’s design has a five-way primary decoder while the Zen 2 core has a four-way decoder. The front-end in the former dispatches six micro-ops to the back-end where there is a common reorder buffer for both integer and floating-point operations.
- The Zen 2 core, on the other hand, dispatches up to six micro-ops to the Integer logic cluster and another (up to) four to the floating-point cluster. This is called a clustered architecture. Basically, the INT and FP EUs are separated early on.
- The rename and retire buffers also use different designs. Intel’s architecture uses a common rename/retirement buffer while AMD’s Zen microarchitecture has different rename queues for INT and FP operations while a 224 entry retire queue is shared between the two clusters.
- One core difference in the execution units is that Intel’s latest Sunny Cove core has support for native AVX-512 instructions while Zen 2 and Skylake are limited to AVX-256.
- The former can do one 512-bit FMA (fused multiply and add) or two 256-bit FMA per cycle while the latter supports four 256-bit instructions per cycle (2 MUL and 2 ADD) along with four INT executions in parallel. Sunny Cove also supports four integer instructions per cycle.
- Sunny Cove has a much wider load and store buffer compared to Skylake and Zen 2. It has a total of 128 entries in the load buffer and 72 in the store buffer. This allows it to perform two load and two store operations per cycle.
- Skylake, on the other hand, has 72 entries in the load buffer and 56 in the store buffer. Like Skylake, Zen 2, has can do two loads and one store per cycle.
- Similarly, the LSU to L1 data cache bandwidth for Sunny Cove is higher at 64 bytes/cycle while Zen 2 is pegged at 32 bytes/cycle.
- For a more detailed explanation of Intel and AMD’s core architectures, read the following:
Both AMD and Intel are planning to launch their next-gen CPU core microarchitectures later this year in the form of Zen 3 and Willow Cove, respectively. Intel will be using the third iteration of its 10nm node for Tiger Lake mobile chips while the desktop counterparts, Rocket Lake (based on Sunny/Willow Cove) will be another 14nm lineup. AMD, on the other hand, will leverage TSMC’s 7nm+ process for the Ryzen 4000 desktop and Epyc Milan processors. Intel’s recently launched 10th Gen desktop lineup is another refresh of the 2016’s Skylake core. This is the fourth consecutive Intel Core generation using the same microarchitecture. Rocket Lake, expected to launch later this year will be the first desktop product line in over four years with a new core design.
Diagram Credits go to WikiChip and Hiroshige Goto