Intel’s Xe-HP GPUs (meant for Data Centers and servers) will feature multiple compute pipelines, allowing for a high compute throughput, especially in mixed precision workloads. This was revealed via a Linux patch to the Intel graphics driver. In the patch notes, the engineer in question (Francisco Jerez) explains that the Xe-HP GPUs have multiple asynchronous compute pipelines unlike the Xe-LP (Gen 12) where the graphics pipeline is mostly linear, except the Extended Maths FP64 units.
The execution units of Xe-HP platforms have multiple asynchronous ALU pipelines instead of (as far as software is concerned) the single in-order pipeline that handled most ALU instructions except for extended math in the original Xe. It’s now the compiler’s responsibility to identify cross-pipeline dependencies and insert synchronization annotations whenever necessary, which are encoded as some additional bits of the SWSB instruction field.
In Gen12 Xe-LP, all the floating-point and integer instructions are handled by the same pipeline using a software-based scheduler (software score boarding). While Xe-HP won’t be leveraging hardware-based scheduling, the compiler will be able to spot dependencies between different queues, allowing asynchronous execution of the workloads. Now, every Execution Unit will maintain separate floating-point and integer pipelines (in addition to EM) although it looks like the hardware will remain the same, with the split happening at a compiler level. While this method isn’t as efficient as traditional hardware-based async compute, it’s still a step up from using simple context switching.
It’s hard to say whether the same implementation will make its way to the Xe-HPG graphics cards or not, but looking at previous slides from the company, it does seem that separate pipes for different instructions are on the cards with FP64 and matrix units for Xe-HP GPUs. This would explain the two additional asynchronous ALU pipelines mentioned in the patch:
/** * TGL+ SWSB RegDist synchronization pipeline. * * On TGL all instructions that use the RegDist synchronization mechanism are * considered to be executed as a single in-order pipeline, therefore only the * TGL_PIPE_FLOAT pipeline is applicable. On XeHP+ platforms there are two * additional asynchronous ALU pipelines (which still execute instructions * in-order and use the RegDist synchronization mechanism). TGL_PIPE_NONE * doesn't provide any RegDist pipeline synchronization information and allows * the hardware to infer the pipeline based on the source types of the * instruction. TGL_PIPE_ALL can be used when synchronization with all ALU * pipelines is intended. */
AMD, on the other hand, uses separate hardware-level pipelines for compute workloads (4 async compute pipelines for Navi 21 and MI200) while NVIDIA has multiple datapaths to allow execution of different kinds of workloads from the same ALUs.
- Intel Gen12 Xe Graphics Architectural Deep Dive: The Bigger, the Better
- AMD Radeon RDNA 2 “Big Navi” Architectural Deep Dive: A Focus on Efficiency