Intel’s 10nm Ice Lake CPUs are the first major update to the company’s Core architecture since 2016’s Skylake. All the processor lineups from Kaby Lake, Coffee Lake as well as the “new” Comet Lake chips leverage the 14nm Skylake core. In this post, we compare Intel’s 10nm Ice Lake CPUs against the 14nm Comet Lake and AMD’s Ryzen 3000 chips. We’ll compare the core architectures powering these CPUs: 14nm Skylake vs 10nm Sunny Cove, Skylake vs 7nm Zen 2. You can read about Tiger Lake (Sunny Cove) and Gen12 Xe here:
- Intel’s Willow Cove Core (Tiger Lake) is Basically Sunny Cove w/ More Cache: Identical Decode, EUs, and BP
- Intel Gen12 Xe Graphics Architectural Deep Dive: The Bigger, the Better
10nm Sunny Cove vs 7nm Zen 2: Front End and Branch Predictors
Unlike AMD, Intel is rather stingy with details regarding the front end. In the case of the latest Sunny Cove core, the front end is largely similar to Skylake. According to Intel, the branch predictor has been fine-tuned and the load/store latencies are slightly better than the preceding design. The branch predictor and prefetcher are also reportedly larger, but there are no concrete details on what exactly has changed.
The primary change to the front-end is with respect to the cache sizes. The L1 Data cache is now 50% larger with a 12 way 48KB set-associative setup, up from 8-way 32KB in Skylake. The instruction cache is unchanged at 32KB. In comparison, AMD’s Zen 2 core has a 32KB L1 instruction and data cache each, same as Skylake.
Sunny Cove’s decoder and fetch are also unchanged with the new 10nm core packing the same 5-way decoder as Skylake. The instruction queue size is the same with 50 entries (25×2) while the instruction fetch is also limited to six issues per cycle.
AMD’s Zen 2 architecture differs greatly from the Core design here. While the data loaded from the L2 cache is around 64KB per cycle for Intel, the Zen 2 core is limited to 32KB per cycle. At the same time, the instruction fetch is twice as wide compared to both Intel Skylake and Sunny Cove (32B vs 16B), but the decoder is slightly narrower with four entries per cycle.
The reason for this is that the two core architectures have fundamentally different front ends. The decoder on the Zen 2 core sends four instructions to the micro-ops queue while the op-cache sends another set of (up to) four micro-ops.
In the case of Skylake and Sunny Cove, the decoder sends five while the branch predictor sends another six instructions to the allocation queue. The MS ROM is capable of dispatching up to four (more complex) decoded instructions to the allocation, but this figure is rarely saturated.
On the Intel side, the L2 cache has also been doubled from 256KB 4-way on Skylake to 512KB 8-way on the 10nm Sunny Cove core. This puts it on par with Zen 2’s L2 cache, at least in terms of size.
An interesting point to note here is that with Intel’s CPUs, the instruction fusion usually happens in the pre-decode stage right after fetching them. The instruction length is calculated and similar ones are fused together into a singular instruction for faster execution. Usually, one fusion per cycle occurs, combining logic/arithmetic instruction and following conditional branch instruction.
AMD, on the other hand, does this in the micro-ops cache, mostly with branch instructions, dispatching the fused macro-ops to the uop queue.
The most important change that Sunny Cove’s front-end has undergone is with respect to the micro-ops cache. It has been increased from 1.5k entries in Skylake to 2.25k entries in Sunny Cove. This was a much-needed improvement, as AMD already has a micro-op size of 4K entries with Zen 2. These cache size increments will drastically improve cache hit rates as well as increasing the buffer size for branch instructions.
Moving down you have the allocation queue. Both designs send up to six micro-ops to the backend for renaming/reordering and execution.
Intel Sunny Cove vs AMD Zen 2: Backend
The re-order/retire buffer has been massively overhauled with Sunny Cove. The new 10nm core has a huge 352 entry reorder buffer, increasing the core’s OoO execution capabilities. Skylake had a 224 entry reorder buffer, and so does Zen 2. Increasing the ROB significantly increases the power and die size. The jump to the 10nm node seems to have absorbed that increase. Skylake had a 224 entry reorder buffer, and so does Zen 2.
On the Intel side, you’ve got a unified scheduler while AMD divides its register renaming and scheduler units between the INT and FP execution units. This kind of segmented architecture is called a clustered (dual) architecture. Intel’s URS has a total of 97 entries with 180 INT physical registers and 168 FP registers.
Zen 2, on the other hand, has a 64 Q integer scheduler, with another 28 Qs for the AGUs, and a 36-entry FP scheduler Q (and an additional non-Scheduling Q with 64 entries). The physical integer and vector registers are nearly the same sizes at Intel’s with 180 and 160 entries, respectively.
In an OoO design, upon execution, an instruction is stored in the ROB, and when the instruction is committed, the value is moved physical register file. The ROB holds values of instructions after execution, but before commit, as a sequence of information (instruction type, flag, name of result register), and the microarchitectural register file stores the latest committed value for every microarchitectural register. The register alias table (RAT), also called the “renaming table”, maps logical to physical registers, indicating the location of the latest definition of each microarchitectural register, between the ROB and physical register.
Overall, Zen 2’s dispatch can send 6 micro-ops to the integer rename buffer, 4 to the FP rename. The 224 entry independent retire queue (which is shared between the two) can send up to 8 (OoO) uops to the dispatch queue for, well, dispatching to either the FP or INT units.
On the Intel side, there’s a common rename buffer and scheduler for INT and FP that receives six micro-ops from the front-end. Sunny Cove has 10 execution ports, four going to the ALUs, two to the Data Store and the remaining four to the Address Generation Units (AGUs) with two loads and two stores. This allows for two loads/stores per clock cycle, a 2x improvement over Skylake and Zen 2. Overall, Sunny Cove can send ten micro-ops from the reorder buffer, a 25% increase over Sky Lake’s 8.
Now, moving to the Execution Units. Ice Lake supports native AVX 512 execution (without division into micro-ops) on the client platform. Sunny Cove can do one 512-bit FMA (fused multiply and add) or two 256-bit FMA per cycle. The integer execution gets some additional units in the form of MUL, MULHi, and iDIV, but the number of INT instructions executed per cycle is still four. The inclusion of the iDIV unit should help significantly reduce integer division time which usually takes several dozen clock cycles.
The server Ice Lake core, unlike the mobile one, can do one FMA512 plus two 256-bit FMA per cycle. This capability is disabled on the mobile platform as it uses too much power.
AnandTech has a neat comparison of Sky Lake and Sunny Cove:
Comparing the Sunny Cove core to Zen 2, we can see that like Skylake, it lacks native AVX-512 support. However, it still does support four 256-bit instructions per cycle (2 MUL and 2 ADD) along with four INT executions in parallel. Zen lacked native support for AVX-256 and had to rely on breaking the instructions into two micro-ops.
Sunny Cove has a much wider load and store buffer compared to Skylake and Zen 2. Considering the increased L1 data cache, this was likely necessary to prevent bottlenecks. It has a total of 128 entries in the load buffer and 72 in the store buffer. Skylake, on the other hand, has 72 entries in the load buffer and 56 in the store buffer.
Similar to the Skylake core, Zen 2, has can do two loads and one store per cycle. The load and store queues are much narrower with 44 and 48 entries, respectively. That’s lower than both Skylake as well as Sunny Cove.
Skylake < Zen 2 < Sunny Cove
In essence, the introduction of the 10nm Sunny Cove core has resulted in an IPC increase of 18% (on average) with the 10th Gen Ice Lake chips. This has allowed Team Blue to retain its IPC lead over AMD, though with Zen 2 it’s nowhere as large as it used to be. Furthermore, poor yields mean that the Sunny Cove based Ice Lake chips are limited to quad-core designs. The recent introduction of octa-core Zen 2 processors (Renoir) not only nullifies Sunny Cove’s IPC advantage but also leaves them far behind in multi-threaded workloads which form the majority of modern applications.
Diagram Credits go to WikiChip and Hiroshige Goto