It turns out that Apple’s M1 single-threaded benchmarks are even more misleading than we originally anticipated. According to a source, the single-threaded tests that were thrown around in the last month or so, most notably Cinebench don’t properly utilize x86 CPUs in the single-core benchmark.
It is worth noting that SMT philosophy is embedded in the design. The decode to uOP, and subsequent optimizations for scheduling through retirement (including intermediate issues instruction dependencies, pipe-line bubbles and flushing, etc.), are a large part of why x86 embraced SMT. RISC load/store architectures simply have less front-end decoding complexity, versus decoupled CISC, and thus are able to obtain better Instruction per Thread, per clock. This is why dispatching multiple threads is required to maximize the performance of a single core (in x86).-A friendly architect who wishes to not be named.
Most benchmarks including Cinebench have two modes, a single-core, and a multi-core test. The single-core tests the single-core performance while the multi-threaded tests the performance of all the cores and threads in action. However, the former only uses one thread, neglecting the other thread gained via SMT (more on that here). SMT is primarily used to improve CPU core utilization in case of a pipeline stall, bubble, or flush which are the drawbacks of using an Out-of-Order processor with branch prediction. Long story short, it can significantly boost performance.
If you are aware of the difference between RISC (Arm) and CISC (Intel and AMD), you’d know that the former has shorter instructions, resulting in a simpler front-end while the latter uses variable-length instructions which is why the branch predictor is so important, and the front-end is often the cause of the ILP wall. It’s also one of the reasons why we’re still basically limited to four-way x86 decoders nearly a decade after they were introduced. If you’d have a wider decode, the backend would be left underutilized as the front-end struggled to decode the instructions without knowing having an ETA. While branch prediction does help mitigate this to some extent, there are some caveats such as the distance to the instruction cache which make it less than ideal.RISC vs CISC
Apple’s M1 core being an Arm design doesn’t suffer from the handicaps of x86 processors and therefore, doesn’t need SMT to improve utilization. As such, its single-threaded benchmarks represent the maximum potential of a single M1 core while the x86 processors are left with some performance to spare:
Thanks to WCCFTech’s quick testing, we are able to see how much this affects the actual performance of contemporary Intel and AMD processors. It helps all the higher-end Renoir and Tiger Lake CPUs beat the M1 chip in the Cinebench R23 ST benchmark by a notable margin. The Ryzen 7 4800U sees a performance boost of as much as 25% with SMT turned on, leaving the M1 behind by around 50 points.