FP8 — LLMpedia

FP8
Name	FP8
Base	IEEE 754
Exponent bias	15
Range	~±57344
Precision	~0.25

Contents

Overview
Technical details
Applications
Hardware support
Comparison with other formats

FP8. FP8 is an 8-bit floating-point format designed to accelerate machine learning and high-performance computing workloads. It represents a significant evolution from traditional formats like FP32 and FP16, offering a balance between computational efficiency and model accuracy. The format has been standardized through collaborative efforts by industry leaders including NVIDIA, Arm, and Intel, and is seeing rapid adoption in next-generation artificial intelligence hardware.

Overview

The development of FP8 is driven by the exponential growth in computational demands from large language models and computer vision applications. Pioneering work by researchers at Google Brain and OpenAI demonstrated that lower precision arithmetic could maintain accuracy while drastically improving performance. This led to its formal proposal within the IEEE 754 revision process, with significant contributions from engineers at AMD and Qualcomm. The format's primary goal is to reduce memory bandwidth pressure and increase throughput in tensor core operations, which are fundamental to deep learning frameworks like TensorFlow and PyTorch.

Technical details

FP8 typically exists in two primary variants: E5M2, with 5 exponent bits and 2 mantissa bits, and E4M3, with 4 exponent bits and 3 mantissa bits. The E5M2 variant, with an exponent bias of 15, offers a wider dynamic range suitable for gradient computation, while E4M3 provides higher precision for activation function storage. Its representation includes a sign bit, following the conventions established by the IEEE Standards Association. The format supports denormal numbers and the full suite of special values—NaN (Not a Number), positive infinity, and negative infinity—as defined in the floating-point arithmetic standard, ensuring compatibility with existing numerical libraries.

Applications

FP8 is predominantly used for inference and training of neural networks, particularly within transformer model architectures popularized by GPT-4 and BERT. It enables faster execution of matrix multiplication kernels in domains such as autonomous vehicle perception systems and natural language processing pipelines at companies like Meta. The format is also being evaluated for scientific computing tasks at laboratories including Lawrence Livermore National Laboratory and CERN, where it can accelerate computational fluid dynamics simulations. Its efficiency makes it critical for deploying models on edge devices powered by Jetson modules and Snapdragon processors.

Hardware support

Major GPU architectures have rapidly integrated FP8 support. NVIDIA's Hopper and Ada Lovelace GPUs feature dedicated units for FP8 processing within their streaming multiprocessor designs. Similarly, AMD's RDNA 3 and Instinct MI300 accelerators include native support, as do Intel's Xe-HPC-based Ponte Vecchio processors. In the mobile and embedded space, Arm's ARMv9-A architecture and Apple's M-series chips incorporate FP8 instructions. These implementations are often coupled with advanced technologies like NVIDIA TensorRT and AMD ROCm software stacks to optimize performance for frameworks including CUDA and OpenCL.

Comparison with other formats

When compared to FP16, FP8 reduces memory footprint and energy consumption by half, which is crucial for data centers operated by Amazon Web Services and Microsoft Azure. Against integer formats like INT8, used in quantization techniques from the TensorFlow Lite framework, FP8 maintains a higher dynamic range without complex calibration procedures. However, it offers less precision than BF16, a format championed by Google for TPU training. The choice between FP8 and FP4 or other narrower formats involves a trade-off studied by teams at MIT and Stanford University, balancing model fidelity against the computational intensity required for projects like AlphaFold.

Category:Computer arithmetic Category:Data types Category:Machine learning