It’s unclear how sharing the two EUs and their resources helps threading (it just reminds me of Bulldozer) but I suspect in Gen11, having just one resulting in the scheduler being left under-utilized, so Intel decided to spread the workload of two EUs across a single software-based scheduler.
The end-result here once again is similar to AMD got with RDNA. By doubling the SIMD width (four to eight) and reducing the number of simultaneous threads (one per two EUs), utilization should increase, allowing easier saturation of GPU resources.
Since the two ALUs have now been merged, this means that FP and integer workloads get equal priority (8-wide). Unlike NVIDIA’s Turing GPUs though, Gen12 can only run INT or FP per execution cycle. However, the separation of the SFUs (Extended Math ALUs) means that they can run in parallel with FP32 and INT32 workloads (at least in theory). With Gen11, Special Functions would stall the regular SP pipeline, causing latency delays.
|Intel Gen11 and Gen12 Throughput|
While the integer capabilities of Gen12 are twice as much as Gen11, they are still half as much as the floating-point or vector throughput. This likely is the result of the pipeline prioritizing vector instructions. Strangely though, it’s only the full-precision integer that’s compromised while the half-precision FP and INT rates are identical.
Cache and Bandwidth
As already mentioned above, the cache also gets a huge uplift. Other than the introduction of the new L1 data/Texture Cache, the L3 cache is more than notably larger than the L3 cache on Gen11, going from 3MB to 3.8MB. The wider L3 cache on Gen12 is also faster than its predecessor, with a transfer capacity of 128 bytes per clock.
Another radical change to the memory system is the addition of a second ringbus, connecting the CPU and GPU. This should essentially double the CPU-GPU bandwidth, significantly improving iGPU performance.
Clock Speeds and SuperFin
Before we move onto the media and encode capabilities of Gen 12 Xe graphics, I’d like to talk about the SuperFin node which also makes an appearance and why it’s important:
As you can see in the above benchmarks, the effective shader or core capabilities of Intel’s Gen11 GPU is rather poor compared to contemporary solutions architectures from AMD and NVIDIA. It is left far behind by NVIDIA’s MX250 which is a mere 384-shader solution. I don’t expect the IPC of Gen12 to be significantly higher than Gen11, so the only thing that can help overcome this hurdle is the operating frequency.
Intel is betting big (or at least marketing a lot) on its SuperFin node design. As per company slides, Gen12 should be at least 50% faster than Gen11, regularly seeing 1500MHz+ boosts in applications. This should help offset the (to some extend) IPC delta between the Xe and rival GeForce and Radeon GPUs.
Display and Media Engine
Continued on next page…