NVIDIA yesterday launched the first chip based on the 7nm Ampere architecture. While not exactly a GPU, it still features the same basic design that will later be used in the consumer Ampere cards. The Tesla A100 or as NVIDIA calls it, “The A100 Tensor Core GPU” is an accelerator that speeds up AI and neural network-related workloads. As such, most of the focus has gone into improving the mixed-precision capabilities and the Tensor cores. The GA100 is the full implementation while the A100 is a cut down variant forming the Ampere based Tesla GPU.
NVIDIA has disabled one entire GPC (Graphics Processing Cluster) on the A100, bringing down the core count from 8,192 to 6,912. The memory also takes a hit in the form of a full HBM2 stack. While the GA100 packs six memory stacks, the A100 has five, connected via x10 512-bit controllers to the GPU.
NVIDIA has basically disabled a whole GPC and as a result the accompanying SMs, cache and memory controllers are lost. The full GA100 block looks something like this:
Here are the specs of the new Ampere based A100 next to its predecessors, the Volta and Pascal powered Tesla accelerators:
|Data Center GPU||NVIDIA Tesla P100||NVIDIA Tesla V100||NVIDIA A100|
|GPU Architecture||NVIDIA Pascal||NVIDIA Volta||NVIDIA Ampere|
|GPU Board Form Factor||SXM||SXM2||SXM4|
|FP32 Cores / SM||64||64||64|
|FP32 Cores / GPU||3584||5120||6912|
|FP64 Cores / SM||32||32||32|
|FP64 Cores / GPU||1792||2560||3456|
|INT32 Cores / SM||NA||64||64|
|INT32 Cores / GPU||NA||5120||6912|
|Tensor Cores / SM||NA||8||42|
|Tensor Cores / GPU||NA||640||432|
|GPU Boost Clock||1480 MHz||1530 MHz||1410 MHz|
Right off the bat, the most noticeable change is the increase in the number of SMs per GPU and the accompanying cores. While the SM configuration hasn’t been changed by much, the tensor cores per SM have been reduced to half as much as Volta. Despite that though, the overall mixed-precision compute capabilities of the A100 are a staggering 20x more than the P100. This is thanks to fine-grained structured sparsity that essentially doubles the throughput.
As you can see in the above figure, Ampere increases the throughput by a factor of 20 in the case of INT8 operations using sparsity, while FP16 based workloads see a jump of 5x. High-precision FP64 operations are faster by a magnitude of 2.5x. Lastly, there is a new data-type supported by Ampere called, TF32 or Tensor Float 32. It uses the same exponential range as FP32 but provides the accuracy of FP16 by using the 10-bit mantissa, instead of 23. TF32 is 20x faster than FP32 and produces a standard IEEE FP32 output.
Ampere SM (A100)
Comparing the Ampere and Volta SMs, you can see that there are two key differences between the two. Firstly, the number of Tensor cores in the former has been reduced to four from eight. Secondly, the L1 Data cache and shared memory have been nearly doubled from 128KB on Volta to 192KB on Ampere. Compared to previous generations, NVIDIA has left the general SM structure largely unchanged, at least on the surface. The warp schedulers and dispatch units are the same and so are the registers. Like Volta, Ampere also supports concurrent INT and FP operations and the two clock cycle execution time is likely the same too.
As already mentioned, the A100 is 2.5x more efficient in accelerating FP64 workloads compared to the V100. This was achieved by replacing the traditional DFMA instructions with FP64 based matrix multiply-add. This reduces the scheduling overhead and shared memory bandwidth requirement by cutting down on instruction fetches.
Every Ampere SM is capable of 64 FP64 FMA operations/clock (or 128 FP64 operations/clock), 2x compared to the Volta-based Tesla V100. This results in a peak FP64 throughput of 19.5 TFLOPs 2.5x more than the V100.
The Secret Sauce: Structured Sparsity
One of the key driving factors behind Ampere’s performance gains is structured sparsity. It essentially doubles the overall throughput for neural networks by compressing the matrix. This reduces the memory storage and bandwidth by 2x.
This is achieved by fine-tuning the weights in the matrices and compressing them to half the size by removing irrelevant (zero) entries, thereby reducing the bandwidth and doubling the throughput.
Sparsity works for AI workloads because the output is usually only dependent on certain values (weights). The rest of them that don’t impact the resultant trained network aren’t useful and can be discarded.
According to NVIDIA’s engineers, the structure constraint does not impact the accuracy of the trained network for inferencing. This enables inferencing acceleration with sparsity. However, for effective results, sparsity needs to be introduced early in the training process
Memory and Cache
The A100 GPU features 40 GB of HBM2 memory organized into five stacks, each containing eight memory dies. With a 1215 MHz (DDR) data rate, the Ampere accelerator delivers a bandwidth of 1.6 TB/sec, nearly twice as much as the Tesla V100.
Ampere has a much higher cache size than Volta. The A100’s L2 cache is a whopping 40MB, nearly seven times more than the V100. It is divided into two partitions and each of them serves the SMs contained in the connected GPCs. This partitioned structure increases the L2 bandwidth by a factor of 2.3x compared to the V100. The increased L2 cache significantly helps improve performance in AI workloads by caching more and more datasets and models on-chip.
The L1 cache is similar to the V100. Like Volta, the Ampere based A100 has a common shared memory and L1 cache. However, it has been increased from 128KB/SM to 192KB/SM. This not only improves performance but also simplifies the code.
Multi-instance GPU Architecture
Another key highlight of the A100 is multi-instance GPU architecture. It lets you partition every A100 GPU into seven independent GPU “instances” for optimal utilization and performance to the individual end-users. Although Volta supported multiple applications to run on separate resources (SMs), it resulted in memory bottlenecks as the cache and DRAM were shared across all applications.
In Ampere, all the GPU instances have their own data and memory paths, through separate L2 cache banks, memory controllers and DRAM address buses. This allows the different instances to run smoothly even if the workloads are diverse and some of them have higher memory bandwidth requirements.
Ampere’s MIG essentially lets you run each A100 as seven different GPUs, improving utilization and scalability at no additional cost.
PCIe 4.0, NVLink v3 and Asynchronous Copy
The inclusion of a larger cache and a massive memory bandwidth requires a faster connection to the rest of the system. This is achieved via the newer PCIe 4.0 standard used in tandem with AMD’s Epyc Rome chips. NVLink, the GPU-to-GPU interconnect has also been updated. Third-generation NVLink has a data rate of 50 Gbit/sec per signal pair, nearly doubling the 25.78 Gbits/sec rate in V100.
The A100 also introduces a new asynchronous copy instruction that allows direct loading of data from the DRAM to the L2 cache. This s called Asynchronous Copy. By bypassing the register file (RF), it reduces register file bandwidth and improves efficiency. Best of all, similar to asynchronous compute, async-copy can be executed in parallel with other SM workloads, without inducing a performance penalty.
Overall, Ampere (A100) increases the performance of HPC and AI workloads by nearly 2x compared to the Volta-based V100. Mixed precision applications see a massive boost thanks to TF32, cutting down execution time by a factor of 20. Lastly, it improves the effective IPC of the Tensor cores, delivering twice as much performance with half as many cores.
Considering how many new players have entered the HPC market lately, it was only a matter of time before NVIDIA launched a new flagship. Although this isn’t a consumer product, it shows how far NVIDIA has come with Ampere. Even if the next-gen GeForce GPU is half as impressive as the A100, it’ll be more than enough to pacify gamers and put the next-gen consoles to shame.