GPGPU — LLMpedia

GPGPU
Name	General-Purpose Computing on Graphics Processing Units
Caption	An NVIDIA Tesla P100, a GPU designed for high-performance computing.
Inventor	Various researchers and companies
Introduced	Early 2000s

Contents

Overview
Architecture and Hardware
Programming Models and APIs
Applications
Performance Considerations
History and Evolution

GPGPU. General-purpose computing on graphics processing units represents a paradigm shift in computing, leveraging the massively parallel architecture of modern graphics processors for non-graphics tasks. This approach exploits the high computational throughput and memory bandwidth of devices originally designed for rendering computer graphics to accelerate a wide range of scientific and data-intensive applications. The field emerged from the realization that the programmable shader units in GPUs could be repurposed for general data-parallel computations, leading to significant performance gains over traditional central processing units for suitable workloads.

Overview

The fundamental concept involves using a graphics processing unit, typically designed for accelerating the rasterization pipeline in systems like DirectX and OpenGL, to perform computations that are not related to image generation. This is possible because many complex problems in fields like computational fluid dynamics and molecular dynamics can be structured as parallel algorithms. Early work was pioneered by researchers such as those at Stanford University and NVIDIA Research, who demonstrated that the stream processing model of GPUs was applicable to general-purpose problems. The success of this approach led to the development of dedicated hardware and software ecosystems from major vendors including NVIDIA, AMD, and Intel.

Architecture and Hardware

Modern hardware for this purpose is built around a many-core architecture consisting of numerous streaming multiprocessors, each containing multiple CUDA cores or analogous execution units. This design provides immense floating-point performance, as seen in products like the NVIDIA A100 and the AMD Instinct MI250X. Key architectural features include high-bandwidth memory technologies such as HBM2 and sophisticated memory hierarchies with dedicated L1 and L2 caches. Interconnect technologies like NVLink and InfiniBand are critical for multi-device configurations in systems like the Oak Ridge National Laboratory's Frontier (supercomputer). The architecture is fundamentally different from a CPU, trading off single-thread performance for extreme parallelism.

Programming Models and APIs

Specialized programming models and application programming interfaces abstract the underlying hardware complexity. NVIDIA's CUDA platform, introduced with the GeForce 8 series, became a dominant model, providing extensions to languages like C++. The open standard OpenCL, maintained by the Khronos Group, offers a vendor-agnostic framework supported by AMD, Intel, and others. More recent high-level approaches include SYCL and directive-based models like OpenMP and OpenACC, which allow developers to annotate code for parallel execution. These models manage the execution of thousands of concurrent threads across the device's processing elements.

Applications

Applications span numerous scientific and commercial domains. In artificial intelligence, it is foundational for training deep neural networks on frameworks like TensorFlow and PyTorch. The Folding@home project uses it for simulating protein folding. In computational finance, it accelerates Monte Carlo methods for risk analysis. Geophysical exploration companies like Schlumberger use it for seismic imaging, while in medical imaging, it speeds up reconstructions for magnetic resonance imaging and computed tomography. The SETI@home project employed it to analyze radio telescope data from the Arecibo Observatory.

Performance Considerations

Achieving optimal performance requires careful attention to memory coalescing, thread divergence, and latency hiding through massive multithreading. Efficient use of the memory hierarchy, including shared memory and cache coherency protocols, is critical. Amdahl's law highlights the challenge of parallelizing sequential portions of code, while Gustafson's law offers an alternative scaling perspective. Performance analysis tools like NVIDIA Nsight and AMD ROCm profiler help developers optimize kernel (parallel computing) execution. Data transfer overhead between the host processor and the device via PCI Express can be a significant bottleneck.

History and Evolution

Early explorations in the late 1990s used graphics APIs like OpenGL and Direct3D to perform general computations, a technique often called "GPGPU via graphics APIs." A landmark was the 2003 publication "Brook for GPUs" by researchers at Stanford University. The 2006 release of NVIDIA's CUDA platform marked a turning point, providing the first widely adopted dedicated parallel computing architecture. This was followed by the release of OpenCL in 2009. The technology's impact was recognized with the 2020 ACM Gordon Bell Prize awarded for work on the Summit (supercomputer). The evolution continues with the integration of specialized cores for ray tracing and tensor operations in modern architectures.