Intel’s Odyssey seems to be finally reaching its conclusion. At its Architectural Day 2021, the chipmaker unveiled its next-gen GPU and CPU architectures, namely Xe-HPG or “Alchemist” and Golden Cove/Gracemont. The former will power the upcoming DG2 graphics cards, the company’s first attempt at finding a foothold in the ever-evolving PC gaming market. In this post, we have a look at the same and analyze how it has changed compared to Xe-LP.
Personally, I found the previous slice/sub-slice structure of the Xe-LP GPUs rather confusing. With DG2/Alchemist, this has been rectified. Similar to how AMD is planning to ditch the Compute Unit with RDNA 3, Intel has replaced the Execution Units (EUs) with the Xe-Core. Each Xe-Core features sixteen Vector Engines and XMX Clusters. The former packs eight ALUs each (much like Xe-LP) while the latter is much denser with 16 ALUs each.
Each Vector Engine is capable of handling 256-bit of data per cycle while the Matrix Engines can manage a rather considerable 1024-bit, as is the norm with Tensor/matrix multiplication. The Xe-Core is essentially Intel’s version of the SM (NVIDIA) or Compute Unit (AMD), featuring the low-level cache and the load-store unit. Interestingly, the Ray Tracing Unit and the Texture Samplers, Geometry Unit, Raster, and Backend Units are placed outside the Xe-Core (at least on paper).
Each Xe-Core is paired with its own Texture Sampler (8-units each) and Ray Tracing Unit, while the Pixel Backend (8-units each) is shared between two. The rest of the fixed-function units such as the Geometry Unit and Rasterizer are shared between four Xe-Cores. Overall, the structure is quite similar to Xe-LP. The primary difference is that everything has been beefed up, with additions such as the RT Unit and XMX units for advanced instructions.
Another interesting point to note here is that Intel’s Ray Tracing Units handle the Ray Traversal, Bounding Box Intersection as well as Triangle Intersection. In comparison, AMD’s RA’s only handle the interaction, with the ray-traversal managed by shader code. This puts Intel’s Xe-HPG GPUs right alongside AMD’s hardware (a slightly behind NVIDIA) at least in terms of ray-tracing, although how it eventually turns out will largely depend on driver support.
Overall, rounding up the numbers gives some interesting results. Each Xe-Core can spit 128 FP32 instructions per cycle for an overall figure of 512 for a Render Slice packing four cores. Overall, the fully enabled DG2 GPU can process 4,096 instructions. This means that each Xe-Core has the same throughput as an Ampere SM, but the RTX 30-series GPUs feature a lot more SMs on the higher end than the DG2 (68 on the 3080 vs 32 here). The same applies to the RT cores.
Interestingly, Intel has squeezed in a lot more XMX units than anyone anticipated. Overall, we have 16 per core and a rather considerable 4,096 on the entire die, resulting in a per-core capacity of 1,024 FP16 ops per cycle per core, the same as the Ampere SM (with sparsity). It’s unclear what Intel plans to do with roughly half of the graphics core dedicated to matrix/Tensor operations, but you can bet it’s what will differentiate the Xe GPUs from the competition (or make them useless).
Compared to Xe-LP (Gen12 Tiger Lake-U), HPG or Alchemist is promising a frequency boost of 1.5x at the same voltages, alongside a 1.5x increase in performance-per-watt. This means that we’re looking at boost frequencies over 1.6GHz on the DG2 lineup. Summing that up with the above calculated FP32 performance of the vectors, we’re looking at around 15 TFLOPs of single-precision performance. This is still quite a bit lower than the 30 TFLOPs the RTX 3080 offers (and won’t be comparable to the 15 TFLOPs rated GeForce cards due to inferior software) but is a good start for Raja Koudri’s team.
Much of this efficiency comes from the use of TSMC’s N6 (6nm) process, but that also means that Intel might be supply-constrained like AMD and NVIDIA. The company might have allocated a large capacity due to its stronger financial capabilities, and the fact that its GPUs are launching next year, it might just avoid early shortages. Either way, it would seem that Intel’s plans in the graphics card market are finally finalizing. The countdown has started at long last.
- Intel Gen12 Xe Graphics Architectural Deep Dive: The Bigger, the Better
- NVIDIA RTX 3080/3090 “Ampere” Architectural Deep Dive: 2x FP32, 2nd Gen RT Cores, 3rd Gen Tensor Cores and RTX IO
- AMD Radeon RDNA 2 “Big Navi” Architectural Deep Dive: A Focus on Efficiency