At its Architectural Day 2021, Intel unveiled its Golden Cove core which represents the next step in its high-performance CPU architecture. Golden Cove succeeds Willow Cove and will compete against AMD’s Zen 3 and Zen 4-based processors. As such, we’ll be comparing it against these cores, plus analyze what has changed compared to its predecessors.
Golden Cove is based on Intel’s 10nm Enhanced SuperFin node (now renamed to Intel 7) and will power the upcoming Alder Lake, and Sapphire Rapids-SP lineup. It has undergone major changes compared to Willow Cove, most of which can be seen as a direct response to Apple’s competing Firestorm cores. With both Golden Cove and Gracemont, Intel has expanded the back and front-end, improved the OoO capabilities, and focused more on power efficiency and real-world performance.
Intel Golden Cove Core Architecture vs Willow Cove
First up, we have the Golden Cove front-end: From the top-down, the L1 instruction cache is unchanged at 32KB (similar to Zen 3 and Willow Cove), but the associated Instruction Translation Lookaside Buffer has been upgraded. It has been doubled from 128 to 256 (4K) entires, resulting in 32 (up from 16) 2M/4M huge pages. The accompanying Branch Target Buffer (BTB) has also been more than doubled, growing from 5K to 12K entries. In comparison, Zen 3 features 6.5K entries, and Willow Cove is limited to just 5K.
The reason for the increased focus on branching is rather straightforward. The decoders are much wider with Golden Cove: A 6-wide decoder, the widest of any x86 core (Zen 3 has 4-way), plus an additional 1:4 complex decoder, two more than Willow Cove. This means more power and a higher latency penalty. As such, Intel is relying more on the micro-op cache. It has been nearly doubled, from 2.25K (on Willow Cove) to 4K entries, putting it on par with Zen 3. According to Intel, the decoder is clock-gated 80% of the time, and the backend mostly relies on the micro-op cache instead.
To feed the wider decoder, the instruction fetch has been doubled from 16 on Willow Cove to 32 bytes on Golden Cove, once again, putting it on par with Zen 3. The op-cache is now capable of sending 8 decoded instructions to the op-queue, much like, yes Zen 3, compared to 6 on Willow Cove.
To keep up with the wider decoder and the op-cache, the micro-op Queue has also been widened. For single-threaded applications not utilizing hyperthreading (SMT), the uop-queue has been increased from 70 to 144. The single thread can use the resources of both threads. With SMT, the queue depth has been increased from 70 in Willow Cove to 72 in Golden Cove.
Golden Cove: Allocation, Scheduling and OoO Execution
Moving to the allocation and the scheduler, here’s where things get different between Zen 3 and Golden Cove. Intel has expanded its latest core architecture to 6 allocation ports and 12 execution ports. Willow Cove had five allocation ports and 10 execution ports, while Zen 3 separates its FP and INT units (see above diagram). The latter can schedule up to eight integer instructions and six FP instructions simultaneously.
On the downside for AMD, the OoO capabilities are much less robust than Intel’s. While Zen 3 is limited to a 256-entry ROB, Golden Cove gets a 512-entry ROB, a massive increase over Willow Cove’s 352 entry buffer. An increase in ROB size usually comes with a notable increase in die area and power consumption, and as such, it’s rather surprising that Intel was able to increase the ROB size without a node shrink.
The changes on the backend are less noticeable. On the FP side, we’re looking at two FADD units which are again, a first for an x86 core. Both Zen 3 and Willow Cove lack Fast Adders which are supposed to be more power-efficient and faster than regular Adders. The FMA units which support FP16 are limited to the server Sapphire Rapids core due to the lack of AVX512 support on Alder Lake.
The Integer Execution (as seen in a recently leaked benchmark) is getting more attention. Golden Cove gets an additional port (Port 10) compared to Willow Cove. Furthermore, LEA (load effective address) instructions are now one-cycle instructions across all five ports. This puts Golden Cove on par with Zen 3 in terms of Integer ALU execution ports, with the added advantage of faster shift/LEA instructions.
Finally, in terms of the AGU, Golden Cove gets an additional Load port. Overall, Golden Cove can perform three 256-bit loads or two 512-bit loads, in addition to two 256-bit stores per cycle. That’s a fair bit more than Zen 3 which is capable of three loads (2x 256b) or two stores (1x 256b) per cycle. Sunny Cove core can do 2x 256b loads and 2x 256b stores at the same time due to the two dedicated ports and the buffer width. It has a much wider load and store buffers of 128 and 72, respectively. Skylake, on the other hand, had 72 entries in the load buffer and 56 in the store buffer. Zen 3 is limited to just 44 loads and 48 store entries.
Conclusion: 19% IPC Gain, But Will it be Enough?
Intel promises an IPC gain of 19% compared to Cypress Cove and Willow Cove. But if we look at the sheer number of changes to the core architecture, this seems rather tame. More importantly, Golden Cove seems barely ahead of Zen 3 in terms of throughput. How it holds up compared to Zen 4 and more importantly, Zen 3D is what will matter in the end. Both Ryzen 5000XT/6000 and Milan 3D are expected to launch in the coming months as a response to Sapphire Rapids and Alder Lake.
- Intel Sunny Cove vs AMD Zen 3 Core Architecture: 11th Gen Rocket Lake vs Ryzen 5000 Processors
- AMD Ryzen 5000 “Zen 3” Architectural Deep Dive