GPUs

NVIDIA Ampere Architectural Analysis: A Look at the A100 Tensor Core GPU

NVIDIA yesterday launched the first chip based on the 7nm Ampere architecture. While not exactly a GPU, it still features the same basic design that will later be used in the consumer Ampere cards. The Tesla A100 or as NVIDIA calls it, “The A100 Tensor Core GPU” is an accelerator that speeds up AI and neural network-related workloads. As such, most of the focus has gone into improving the mixed-precision capabilities and the Tensor cores. The GA100 is the full implementation while the A100 is a cut down variant forming the Ampere based Tesla GPU.

NVIDIA has disabled one entire GPC (Graphics Processing Cluster) on the A100, bringing down the core count from 8,192 to 6,912. The memory also takes a hit in the form of a full HBM2 stack. While the GA100 packs six memory stacks, the A100 has five, connected via x10 512-bit controllers to the GPU.

NVIDIA has basically disabled a whole GPC and as a result the accompanying SMs, cache and memory controllers are lost. The full GA100 block looks something like this:

Here are the specs of the new Ampere based A100 next to its predecessors, the Volta and Pascal powered Tesla accelerators:

Data Center GPUNVIDIA Tesla P100NVIDIA Tesla V100NVIDIA A100
GPU CodenameGP100GV100GA100
GPU ArchitectureNVIDIA PascalNVIDIA VoltaNVIDIA Ampere
GPU Board Form Factor SXMSXM2SXM4
SMs5680108
TPCs284054
FP32 Cores / SM646464
FP32 Cores / GPU358451206912
FP64 Cores / SM323232
FP64 Cores / GPU179225603456
INT32 Cores / SMNA6464
INT32 Cores / GPUNA51206912
Tensor Cores / SMNA842
Tensor Cores / GPUNA640432
GPU Boost Clock1480 MHz1530 MHz1410 MHz

Right off the bat, the most noticeable change is the increase in the number of SMs per GPU and the accompanying cores. While the SM configuration hasn’t been changed by much, the tensor cores per SM have been reduced to half as much as Volta. Despite that though, the overall mixed-precision compute capabilities of the A100 are a staggering 20x more than the P100. This is thanks to fine-grained structured sparsity that essentially doubles the throughput.

As you can see in the above figure, Ampere increases the throughput by a factor of 20 in the case of INT8 operations using sparsity, while FP16 based workloads see a jump of 5x. High-precision FP64 operations are faster by a magnitude of 2.5x. Lastly, there is a new data-type supported by Ampere called, TF32 or Tensor Float 32. It uses the same exponential range as FP32 but provides the accuracy of FP16 by using the 10-bit mantissa, instead of 23. TF32 is 20x faster than FP32 and produces a standard IEEE FP32 output.

Ampere SM (A100)

Ampere

Comparing the Ampere and Volta SMs, you can see that there are two key differences between the two. Firstly, the number of Tensor cores in the former has been reduced to four from eight. Secondly, the L1 Data cache and shared memory have been nearly doubled from 128KB on Volta to 192KB on Ampere. Compared to previous generations, NVIDIA has left the general SM structure largely unchanged, at least on the surface. The warp schedulers and dispatch units are the same and so are the registers. Like Volta, Ampere also supports concurrent INT and FP operations and the two clock cycle execution time is likely the same too.

Volta

As already mentioned, the A100 is 2.5x more efficient in accelerating FP64 workloads compared to the V100. This was achieved by replacing the traditional DFMA instructions with FP64 based matrix multiply-add. This reduces the scheduling overhead and shared memory bandwidth requirement by cutting down on instruction fetches.

Every Ampere SM is capable of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), 2x compared to the Volta-based Tesla V100. This results in a peak FP64 throughput of 19.5 TFLOPs 2.5x more than the V100.

The Secret Sauce: Structured Sparsity 

One of the key driving factors behind Ampere’s performance gains is structured sparsity. It essentially doubles the overall throughput for neural networks by compressing the matrix. This reduces the memory storage and bandwidth by 2x.

This is achieved by fine-tuning the weights in the matrices and compressing them to half the size by removing irrelevant (zero) entries, thereby reducing the bandwidth and doubling the throughput.

Sparsity works for AI workloads because the output is usually only dependent on certain values (weights). The rest of them that don’t impact the resultant trained network aren’t useful and can be discarded.

According to NVIDIA’s engineers, the structure constraint does not impact the accuracy of the trained network for inferencing. This enables inferencing acceleration with sparsity. However, for effective results, sparsity needs to be introduced early in the training process

Memory and Cache

The A100 GPU features 40 GB of HBM2 memory organized into five stacks, each containing eight memory dies. With a 1215 MHz (DDR) data rate, the Ampere accelerator delivers a bandwidth of 1.6 TB/sec, nearly twice as much as the Tesla V100.

Ampere has a much higher cache size than Volta. The A100’s L2 cache is a whopping 40MB, nearly seven times more than the V100. It is divided into two partitions and each of them serves the SMs contained in the connected GPCs. This partitioned structure increases the L2 bandwidth by a factor of 2.3x compared to the V100. The increased L2 cache significantly helps improve performance in AI workloads by caching more and more datasets and models on-chip.

The L1 cache is similar to the V100. Like Volta, the Ampere based A100 has a common shared memory and L1 cache. However, it has been increased from 128KB/SM to 192KB/SM. This not only improves performance but also simplifies the code.

Multi-instance GPU Architecture

Another key highlight of the A100 is multi-instance GPU architecture. It lets you partition every A100 GPU into seven independent GPU “instances” for optimal utilization and performance to the individual end-users. Although Volta supported multiple applications to run on separate resources (SMs), it resulted in memory bottlenecks as the cache and DRAM were shared across all applications.

In Ampere, all the GPU instances have their own data and memory paths, through separate L2 cache banks, memory controllers and DRAM address buses. This allows the different instances to run smoothly even if the workloads are diverse and some of them have higher memory bandwidth requirements.

Ampere’s MIG essentially lets you run each A100 as seven different GPUs, improving utilization and scalability at no additional cost.

PCIe 4.0, NVLink v3 and Asynchronous Copy

The inclusion of a larger cache and a massive memory bandwidth requires a faster connection to the rest of the system. This is achieved via the newer PCIe 4.0 standard used in tandem with AMD’s Epyc Rome chips. NVLink, the GPU-to-GPU interconnect has also been updated. Third-generation NVLink has a data rate of 50 Gbit/sec per signal pair, nearly doubling the 25.78 Gbits/sec rate in V100.

The A100 also introduces a new asynchronous copy instruction that allows direct loading of data from the DRAM to the L2 cache. This s called Asynchronous Copy. By bypassing the register file (RF), it reduces register file bandwidth and improves efficiency. Best of all, similar to asynchronous compute, async-copy can be executed in parallel with other SM workloads, without inducing a performance penalty.

Conclusion

Overall, Ampere (A100) increases the performance of HPC and AI workloads by nearly 2x compared to the Volta-based V100. Mixed precision applications see a massive boost thanks to TF32, cutting down execution time by a factor of 20. Lastly, it improves the effective IPC of the Tensor cores, delivering twice as much performance with half as many cores.

Considering how many new players have entered the HPC market lately, it was only a matter of time before NVIDIA launched a new flagship. Although this isn’t a consumer product, it shows how far NVIDIA has come with Ampere. Even if the next-gen GeForce GPU is half as impressive as the A100, it’ll be more than enough to pacify gamers and put the next-gen consoles to shame.

Areej

Computer Engineering dropout (3 years), writer, journalist, and amateur poet. I started Techquila while in college to address my hardware passion. Although largely successful, it suffered from many internal weaknesses. Left and now working on Hardware Times, a site purely dedicated to. Processor architectures and in-depth benchmarks. That's what we do here at Hardware Times!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Back to top button