Tensor Core — LLMpedia

Tensor Core
Name	Tensor Core
Designer	Nvidia
Launched	2017
Type	ASIC
Application	Artificial intelligence, Machine learning, High-performance computing

Contents

Overview
Architecture and Design
Performance and Applications
History and Development
Comparison with Other Technologies

Tensor Core. A specialized ASIC designed by Nvidia to dramatically accelerate matrix operations, which are fundamental to Deep learning and Scientific computing. First introduced in the Volta GPU architecture, these cores perform mixed-precision calculations, optimizing the throughput for training and inference workloads in Artificial intelligence. Their integration has been pivotal in systems like the DGX-1 and the Selene supercomputer, pushing the boundaries of High-performance computing.

Overview

The core functionality revolves around accelerating the fused multiply-add operation on small matrices, a cornerstone for algorithms used in Neural network training. This design is integral to Nvidia's strategy for Data center and Supercomputer markets, providing the computational muscle for platforms like the Nvidia A100 and the NVIDIA H100. By offloading these intensive operations from traditional CUDA cores, they enable breakthroughs in fields ranging from Computational fluid dynamics to Natural language processing, as seen in models like GPT-3.

Architecture and Design

Architecturally, these units are built to perform Mixed-precision arithmetic, typically handling multiplications in FP16 or BFLOAT16 while accumulating results in FP32 or FP64 to maintain numerical accuracy. This design is a key feature of successive Nvidia microarchitectures, including Ampere, Hopper, and Ada Lovelace. They operate in tandem with MCM designs and high-bandwidth memory like HBM2e in products such as the Nvidia A100. The Sparse Tensor Core variant, introduced with the Ampere architecture, further optimizes performance by efficiently processing sparse matrices, a technique leveraged in the NVIDIA DGX A100.

Performance and Applications

In terms of performance, these cores deliver massive FLOPS improvements for Deep learning workloads, a fact demonstrated in benchmarks from the MLPerf consortium. This capability is critical for training large-scale Artificial intelligence models at institutions like OpenAI and Google DeepMind. Applications span numerous domains: in Autonomous vehicle research by companies like Waymo, in Drug discovery simulations for pharmaceutical research, and in Climate modeling projects such as those run by the National Center for Atmospheric Research. The technology also powers real-time inference in services from Microsoft Azure and Amazon Web Services.

History and Development

The development was first publicly detailed with the unveiling of the Volta architecture at the GPU Technology Conference in 2017. This innovation was directly driven by the escalating computational demands of Deep learning frameworks like TensorFlow and PyTorch. Subsequent generations saw significant evolution; the Ampere architecture doubled the throughput and introduced sparsity support, while the Hopper architecture incorporated the Transformer Engine specifically to accelerate models like those from the GPT series. This trajectory has been central to Nvidia's dominance in the AI accelerator market, competing with offerings from AMD and Intel.

Comparison with Other Technologies

When compared to general-purpose CUDA cores or competing architectures like AMD's Matrix Cores in CDNA or Intel's XMX in Intel Arc, the specialization for matrix operations offers superior efficiency for AI workloads. However, alternatives like Google's TPU, Cerebras's WSE, or Graphcore's IPU take different architectural approaches, often optimizing for specific model types or scalability. In the broader High-performance computing landscape, their role complements traditional CPU-based systems, such as those using AMD EPYC or Intel Xeon processors, within heterogeneous computing environments like the Frontier supercomputer.

Category:Nvidia Category:AI accelerators Category:Computer hardware