NVIDIA Ampere “Traversal Coprocessor” Won’t be a Separate Chip; Likely an On-Die Component

Recently a patent was spotted online according to which NVIDIA’s Ampere GPUs will feature a coprocessor for BVH acceleration and ray-triangle intersection. Since then speculations have run rampant with many reports claiming that the next-gen NVIDIA GPUs will feature two PCBs with the main GPU and an accompanying coprocessor for ray-tracing acceleration.

While I had my suspicions, but Twitter user “CyberCat” has mostly confirmed them. As the user states:

It would be really bizarre to move something like raytracing off-die. RT needs vast memory bandwidth and very low latency and it has to occur fast, in sync with the geometry pipeline.

What he means by this is that moving the RT component off-die would induce a latency penalty significantly slowing down the rendering pipeline. The RTCores run the BVH and ray-triangle intersection tests in parallel to the main pipeline and it’s essential for the two to be in-sync. If the RTCores or the coprocessor is moved off-die, it would essentially make it hard to run the two in parallel, essentially making the latter a bottleneck.

NVIDIA RTX “Turing” GPU Architectural Analysis: How RTX (Ray-Tracing) was Turned On

What is Ray-Tracing and How Does it Work?

As you can see in the below diagram, the TTU (tree-traversal unit) is part of the GPC (Graphics Processing Cluster) meaning that it’s an on-die component rather than a separate processor.

Here’s How You Can Enable Ray-Tracing in Almost Every Game

The above diagram show which operations are conducted in the SM, TTU, and the L1C. The initial Boundry Volume Hierarchy creation (PSO) and Ray-Creation and Distribution happen on the SM. The TTU (coprocessor) continuously interacts with the L1 cache which would be a slow process if the component off-die. Finally, both the “Top-Level” and the “Bottom Level” BVH Traversal as well as the Ray Transformation and Ray/Triangle Intersection Testing (Basically the entire RT pipeline) has access to the SM L0 cache which would only be ideal if the “coprocessor” is an on-die component.


Computer Engineering dropout (3 years), writer, journalist, and amateur poet. I started my first technology blog, Techquila while in college to address my hardware passion. Although largely successful, it was a classic example of too many people trying out multiple different things but getting nothing done. Left in late 2019 and been working on Hardware Times ever since.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to top button