GPUs

Intel Gen12 Xe Graphics Architectural Deep Dive: The Bigger, the Better

It’s unclear how sharing the two EUs and their resources helps threading (it just reminds me of Bulldozer) but I suspect in Gen11, having just one resulting in the scheduler being left under-utilized, so Intel decided to spread the workload of two EUs across a single software-based scheduler.

The end-result here once again is similar to AMD got with RDNA. By doubling the SIMD width (four to eight) and reducing the number of simultaneous threads (one per two EUs), utilization should increase, allowing easier saturation of GPU resources.

Since the two ALUs have now been merged, this means that FP and integer workloads get equal priority (8-wide). Unlike NVIDIA’s Turing GPUs though, Gen12 can only run INT or FP per execution cycle. However, the separation of the SFUs (Extended Math ALUs) means that they can run in parallel with FP32 and INT32 workloads (at least in theory). With Gen11, Special Functions would stall the regular SP pipeline, causing latency delays.

Intel Gen11 and Gen12 Throughput
 Gen12Gen11
FP321616
FP163232
INT3284
INT163216

While the integer capabilities of Gen12 are twice as much as Gen11, they are still half as much as the floating-point or vector throughput. This likely is the result of the pipeline prioritizing vector instructions. Strangely though, it’s only the full-precision integer that’s compromised while the half-precision FP and INT rates are identical.

Cache and Bandwidth

As already mentioned above, the cache also gets a huge uplift. Other than the introduction of the new L1 data/Texture Cache, the L3 cache is more than notably larger than the L3 cache on Gen11, going from 3MB to 3.8MB. The wider L3 cache on Gen12 is also faster than its predecessor, with a transfer capacity of 128 bytes per clock.

Another radical change to the memory system is the addition of a second ringbus, connecting the CPU and GPU. This should essentially double the CPU-GPU bandwidth, significantly improving iGPU performance.

Clock Speeds and SuperFin

Before we move onto the media and encode capabilities of Gen 12 Xe graphics, I’d like to talk about the SuperFin node which also makes an appearance and why it’s important:

Source

As you can see in the above benchmarks, the effective shader or core capabilities of Intel’s Gen11 GPU is rather poor compared to contemporary solutions architectures from AMD and NVIDIA. It is left far behind by NVIDIA’s MX250 which is a mere 384-shader solution. I don’t expect the IPC of Gen12 to be significantly higher than Gen11, so the only thing that can help overcome this hurdle is the operating frequency.

Intel is betting big (or at least marketing a lot) on its SuperFin node design. As per company slides, Gen12 should be at least 50% faster than Gen11, regularly seeing 1500MHz+ boosts in applications. This should help offset the (to some extend) IPC delta between the Xe and rival GeForce and Radeon GPUs.

Display and Media Engine

Continued on next page…

Previous page 1 2 3Next page

Areej

Computer Engineering dropout (3 years), writer, journalist, and amateur poet. I started my first technology blog, Techquila while in college to address my hardware passion. Although largely successful, it was a classic example of too many people trying out multiple different things but getting nothing done. Left in late 2019 and been working on Hardware Times ever since.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button