Not only did the Bulldozer architecture have a shared L2 cache between every two cores in a module, but it was also slower than the preceding K10 chips. The L2 on the AMD FX-8150 had a latency penalty of more than twice as much compared to Intel’s competing Sandy Bridge Core i7 (lesser is better).
Although the caches on Bulldozer are larger than the ones on Sandy Bridge and Zen, they are considerably slower. Even at 4.6GHz, the FX chip’s L2 cache is 2x slower compared to the 2600K. The shared logic and cache architecture forced AMD to use a larger L2 cache, but the company went with cheaper and slower memory. Cache memory is one of the most expensive components of a CPU and considering how the Bulldozer parts perform, the company couldn’t afford to spend money on a faster, pricier cache reserve.
Bulldozer’s front-end was another bottleneck. The same 4-way fetch and decode were shared between two cores. AMD utilized interleaved multi-threading to track individual threads and give priority to the one that needs most of the work completed. One thread could be worked upon per cycle, meaning that the other sat idle.
Intel’s Sandy Bridge and the new Zen architecture like a standard design have a dedicated front-end for each core. The former fetches six macro-ops per cycle, sending five to the 4-way decoder.
AMD’s Zen architecture also has a four-way decoder, but it can also fetch up to eight instructions from the op-cache. Bulldozer doesn’t even have a micro-op cache.
Windows 7 Scheduler
While all the other bottlenecks are relatively well-known, this last one is often overlooked. Back in the days of Bulldozer, Windows 7 was the de-facto Windows OS. The resource allocation and scheduling of the operating system weren’t suited for the Bulldozer architecture. AMD’s octa-core models were basically four modules consisting of two cores each, sharing resources. However, Windows 8 didn’t see it that way. It saw 8 independent cores running simultaneously.
When the scheduler assigned a new thread 1b (dependent on data from 1a) to a separate module, it created a problem. AMD’s Turbo Boost technology couldn’t be enabled in this scenario. And Windows 7 did this quite often. Considering that the Bulldozer architecture was built for higher clock speeds to offset the low IPC, this was a major drawback.
AMD’s Turbo Core kicks in only when four cores from two modules are active. However, there are many times when the above-explained situation will cause four cores from three modules to be leveraged. Turbo Boost can’t be enabled in this scenario. Tests show that disabling two modules actually yields better performance than stock in many applications.
This is due to the poor IPC and single-threaded performance of the Bulldozer FX processors. Here are the complete diagrams of the Bulldozer, Sandy Bridge, Sky Lake, and Zen architectures to compare side-by-side:
Sources: Wikichip, AMD, Intel and PC Watch Japan.