NVLink (interface)

NVLink (interface)
Name	NVLink
Invent-date	2014
Invent-name	Nvidia
Supersedes	PCI Express

Contents

Overview
Technical specifications
History and development
Implementations and products
Comparison with other interconnects

NVLink (interface). NVLink is a high-bandwidth, energy-efficient interconnect technology developed by Nvidia to facilitate fast data exchange between CPUs and GPUs, as well as between multiple GPUs. It was designed to overcome the bandwidth limitations of traditional interconnects like PCI Express, enabling more efficient scaling for high-performance computing and artificial intelligence workloads. The technology is a cornerstone of Nvidia's accelerated computing platforms, prominently featured in systems like the DGX SuperPOD and the Summit supercomputer.

Overview

NVLink creates a direct, high-speed communication pathway that allows processors to share memory resources, a concept known as unified virtual addressing. This architecture is critical for applications in scientific computing and deep learning, where massive datasets must be accessed rapidly across multiple accelerators. By enabling a cache-coherent link between the CPU and GPU, it reduces data movement bottlenecks, significantly improving performance for complex simulations and model training. The interconnect is integral to Nvidia's broader platform strategy, which includes software like CUDA and hardware such as the Grace Hopper Superchip.

Technical specifications

The specifications of NVLink have evolved significantly across generations. The initial version, introduced with the Pascal-based Tesla P100, offered 160 GB/s of bidirectional bandwidth per link. The subsequent NVLink 2, deployed in the Volta architecture's Tesla V100, increased this to 300 GB/s. The latest iteration, NVLink 4, present in the Hopper-based Nvidia H100, provides up to 900 GB/s of bandwidth. The physical layer utilizes multiple lanes for data transmission and supports advanced features like ECC for reliability. Its protocol supports both cache-coherent and non-coherent transactions, allowing flexible integration with different CPU designs, such as those from IBM and Arm.

History and development

NVLink was first publicly announced by Nvidia in 2014, with the goal of redefining data center architecture by tightly coupling CPUs and GPUs. Its development was driven by the performance walls encountered in systems using PCI Express for communication in high-performance computing environments. The first commercial implementation arrived in 2016 with the Pascal generation, notably powering the DGX-1 AI supercomputer. A major milestone was its adoption in the Summit and Sierra supercomputers at Oak Ridge National Laboratory and Lawrence Livermore National Laboratory, which utilized NVLink 2 to connect IBM POWER9 processors with Nvidia Tesla V100 GPUs.

Implementations and products

NVLink has been implemented across several key Nvidia product lines and partner systems. In the data center, it is a defining feature of the Nvidia DGX series, including the DGX A100 and DGX H100, and is essential for the Nvidia HGX baseboard platform used by major OEMs like Hewlett Packard Enterprise and Dell Technologies. The Grace Hopper Superchip leverages a proprietary version of NVLink to connect the Grace CPU and Hopper GPU. Consumer-grade implementations, branded as SLI NVLink, were offered on GeForce 10 Series and 20 Series cards, such as the GeForce RTX 2080 Ti.

Comparison with other interconnects

Compared to the ubiquitous PCI Express standard, NVLink provides substantially higher bandwidth and lower latency, which is crucial for GPU-centric workloads. While PCI Express 5.0 offers up to 128 GB/s bidirectional bandwidth, NVLink 4 exceeds 900 GB/s. Unlike InfiniBand, which is a network fabric for connecting separate nodes, NVLink is designed for intra-node connectivity within a single server or supercomputer. It also differs from AMD's Infinity Fabric, which is a more generalized interconnect architecture for linking components within AMD's own CPUs and GPUs, such as in the AMD Instinct MI250X. The Compute Express Link (CXL) consortium, which includes Intel, AMD, and Arm, is developing a competing cache-coherent standard for CPU-to-device connectivity.