Intel’s Ice Lake CPUs are the first major update to the company’s Core architecture since 2016’s Skylake. All the processor lineups from Kaby Lake, Coffee Lake as well as the “new” Comet Lake chips leverage the 14nm Skylake core. The 10nm Sunny Cove powered Ice Lake CPUs, therefore, form not only Intel’s first 10nm product stack but also the first post-Skylake core architecture. In this post, we compare Intel’s 10nm Ice Lake CPUs to AMD’s competing Ryzen 3000 processors as well as the older 14nm Skylake architecture.
Intel’s 10th Gen mobile lineup is composed of Comet Lake and Ice Lake CPUs. While the former features the Skylake core, Ice Lake is based on the newer Sunny Cove design. That’s basically what we’ll be having a look at: A side-by-side comparison of Sunny Cove and the Zen 2 cores, as well as the improvements over Skylake.
10nm Sunny Cove vs 7nm Zen 2: Front End and Branch Predictors
Unlike AMD, Intel is rather stingy with details regarding the front end. In the case of the latest Sunny Cove core, the front end is largely similar to Skylake. According to Intel, the branch predictor has been fine-tuned and the load/store latencies are slightly better than the preceding design. The branch predictor and prefetcher are also reportedly larger, but there are no concrete details on what exactly has changed.
The primary change to the front-end is with respect to the cache sizes. The L1 Data cache is now 50% larger with a 12 way 48KB set-associative setup, up from 8-way 32KB in Skylake. The instruction cache is unchanged at 32KB. AMD’s Zen 2 core, on the other hand, has a 32KB L1 instruction and data cache each, same as Skylake.
The decoder seems mostly unchanged in Sunny Cove with the new 10nm core packing the same old 5-way decoder as Skylake. The instruction queue size is also the same with 50 entries (25×2) while the instruction fetch is also unchanged with six issues per cycle.
AMD’s Zen architecture differs greatly from the Core design here. While the data fetched from the L2 cache is around 64KB per cycle for Intel, the Zen 2 core is limited to 32KB. Although the fetch is twice as larger than both the Intel Skylake and Sunny Cove designs, the decoder is slightly narrower with four entries per cycle. This should allow the former to fetch longer instructions but only up to four per cycle.
On the Intel side, the L2 cache has also been doubled from 256KB 4-way on Skylake to 512KB 8-way on the 10nm Sunny Cove core. This puts it on par with Zen 2’s L2 cache, at least in terms of size. The latencies will still vary.
The most important change that Sunny Cove’s front-end has undergone is with respect to the micro-ops cache. It has been increased from 1.5k entries in Skylake to 2.25k entries in Ice Lake (SC). This was a much-needed improvement, as AMD already has a micro-op size of 4k entries with Zen 2. These cache size increments will drastically improve cache hit rates.
Moving down you have the allocation queue. Both designs send up to six micro-ops to the backend for renaming/reordering and execution.
AMD vs Intel Core Backend: AVX256 vs AVX512
The re-order/retire buffer has been massively overhauled with Sunny Cove. The new 10nm core has a huge 352 entry reorder buffer for micro-op renaming and reallocation (plus retirement). Skylake had a 224 entry reorder buffer, and so does Zen 2.
However, in the case of the latter, the retire queue is separate from the main execution pipeline, and there are individual rename buffers for the integer and FP pipelines. Overall, Zen 2’s dispatch can send 6 micro-ops to the integer rename buffer, four to the FP rename and 8 to the 224 entry independent retire queue from where they are sent to either of the other two. On the Intel side, there’s a common reorder buffer for INT and FP that receives six micro-ops from the front-end.
Ice Lake’s 10nm Sunny Cove has 10 execution ports, four going to the ALUs, two to the Data Store and the remaining four to the Address Generation Units (AGUs) with two loads and two stores. This allows for two loads/stores per clock cycle, a 2x improvement over Skylake.
Overall, Sunny Cove can send ten micro-ops from the reorder buffer, a 25% increase over Sky Lake’s 8. Now, moving to the Execution Units. Ice Lake supports native AVX 512 execution (without division into micro-ops) on the client platform. Sunny Cove can do one 512-bit FMA (fused multiply and add) or two 256-bit FMA per cycle. The integer execution gets some additional units in the form of MUL, MULHi, and iDIV, but the number of INT instructions executed per cycle is still four. The inclusion of the iDIV unit should help significantly reduce integer division time which usually takes several dozen clock cycles.
AnandTech has a neat comparison of Sky Lake and Sunny Cove:
Comparing the Sunny Cove core to Zen 2, we can see that like Skylake, it lacks AVX-512. However, it still does support four 256-bit instructions per cycle (2 MUL and 2 ADD) along with four INT executions in parallel. Zen lacked native support for AVX-256 and had to rely on breaking the instructions into two micro-ops.
Sunny Cove has a much wider load and store buffer compared to Skylake and Zen 2. The former has 128 entries and the latter has 72. Skylake, on the other hand, has 72 entries in the load buffer and 56 in the store buffer. Similar to the older Skylake core, Zen 2, has can do two loads and one store per cycle.
The load and store queues are also much narrower with 44 and 48 entries, respectively. That’s lower than both Skylake as well as Sunny Cove.
Skylake < Zen 2 < Sunny Cove
In essence, these improvements have resulted in an IPC increase of 18% (on average) with the 10th Gen Ice Lake chips. This has allowed Team Blue to retain its IPC lead over AMD, though with Zen 2 it’s nowhere as large as it used to be. Furthermore, poor yields mean that the Sunny Cove based Ice Lake chips are limited to quad-core designs. The recent introduction of octa-core Zen 2 processors (Renoir) not only nullifies Sunny Cove’s IPC advantage but also leaves them far behind in multi-threaded workloads which form the majority of modern applications.
Diagram Credits go to WikiChip and Hiroshige Goto