As per a source on Twitter, AMD’s Big Navi GPU will indeed use 80 compute units or 5,120 shaders resulting in a performance rating of 20 TFLOP FP32. However, it turns out that 20 of these CUs (nearly an Xbox Series S GPU) will be used as dedicated ray-tracing units in games that feature the technology. This would make sense as unlike NVIDIA’s RTCores, Navi 2x only features dedicated hardware units for ray-box and ray-triangle intersection testing. The BVH traversal will be done by the shaders, and it’s these 20 CUs that will do that job:
In the above image, you can see the ray-acceleration units and Texture samplers share resources and only one can operate per cycle. Furthermore, these ray-acceleration units can’t do BVH traversal, that will fall to the SIMD32 units or the stream processors.
Furthermore, there will be 320 Texture samplers and 96 ROPs, roughly the same as the RTX 3080 which packs 272 TMUs and 96 ROPs, with the main difference being that the latter is linked with GPCs rather than the memory controller.
Furthermore, the GeForce RTX 3080 features a total of 68 RT cores and 272 Tensor cores, with 5MB of L2 cache and 128KB of L1 per SM. The total number of shaders is 8704, but the performance won’t scale as much. We’ll see roughly the same performance as with a 5,000 Turing GPU as the integer compute can’t run on the same cores as FP32. Overall, it looks like the memory system will be Ampere’s primary advantage against Big Navi, It remains to be seen if the higher core clocks of the latter will be enough to counter the additional bandwidth. We’ll likely see larger deficits at 4K as that’s where the need for more memory arises. At 1440p, it should be relatively lower.
- NVIDIA RTX 3080/3090 “Ampere” Architectural Deep Dive: 2x FP32, 2nd Gen RT Cores, 3rd Gen Tensor Cores and RTX IO
- Update: A patent from a while ago:
As this patent shows, the BVH traversal is performed using the shader in these CUs, with the accompanying ray-acceleration units being used for ray-triangle/box intersection tests. The related data is managed using the vGPRs. This is a different approach than what NVIDIA takes. Each RTCore in an RTX GPU is separate from the main graphics and compute pipeline and can run in parallel with them, and neither do the two share any resources. More on that in the above post.