Generated by GPT-5-mini| cuDNN | |
|---|---|
| Name | cuDNN |
| Developer | NVIDIA |
| Released | 2014 |
| Programming language | C, C++ |
| Operating system | Linux, Windows |
| Platform | x86_64, ARM64 |
| Genre | Deep learning library |
| License | Proprietary |
cuDNN
cuDNN is a GPU-accelerated library for deep learning primitives developed for NVIDIA GPUs. It provides highly tuned implementations of convolutional, pooling, normalization, and activation routines used in modern neural networks, and is widely employed in conjunction with frameworks and toolchains across industry and academia. The library interoperates with numerous ecosystems and has influenced performance engineering practices in high-performance computing and artificial intelligence research.
cuDNN was developed by NVIDIA to accelerate deep neural network workloads on CUDA-enabled GPUs and has become a foundational component in many production stacks and research projects. It interfaces with frameworks such as TensorFlow, PyTorch, MXNet, Caffe, Theano, Chainer, CNTK, Torch, Keras, PaddlePaddle, Darknet, ONNX Runtime, JAX, Apache TVM, TensorRT, Horovod, DeepSpeed, OpenVINO, Hugging Face, FastAI, Allen Institute for AI, Baidu Research, OpenAI, DeepMind, Facebook AI Research, Microsoft Research, and Google Research. The library is optimized for Nvidia architectures including Kepler (microarchitecture), Maxwell (microarchitecture), Pascal (microarchitecture), Volta (microarchitecture), Turing (microarchitecture), Ampere (microarchitecture), Ada Lovelace (microarchitecture) and is often used on platforms from NVIDIA DGX Station, NVIDIA DGX-1, NVIDIA A100, NVIDIA V100, NVIDIA GTX, and NVIDIA RTX families.
cuDNN exposes a C-based API and a C++ wrapper to integrate low-level kernels into higher-level frameworks and runtime systems. Its design encapsulates algorithms for convolutions, recurrent networks, and tensor transformations that interface with libraries like CUDA, cuBLAS, cuFFT, cuSPARSE, NCCL, and runtime systems such as CUDA Toolkit, CUDA Graphs, CUDA Streams, NVLink, PCI Express, CUDA Unified Memory, and GPUDirect. The library implements algorithm selection and heuristics similar to approaches in BLAS-style libraries and borrows inspection and autotuning concepts used in ATLAS (software), FFTW, MKL, OpenBLAS, and Eigen (software library). cuDNN supports mixed-precision and tensor core acceleration influenced by work from NVIDIA Research and integrates optimizations compatible with compute libraries and compiler toolchains like GCC, Clang, LLVM, NVCC, and project infrastructures such as Bazel, CMake, and Make (software).
cuDNN implements core neural network building blocks including convolution, transposed convolution, grouped convolution, depthwise convolution, and Winograd convolutions used in models from AlexNet, VGG (family), ResNet, Inception (software) families, MobileNet, EfficientNet, DenseNet, ShuffleNet, SqueezeNet, U-Net, Transformer (machine learning model), BERT, GPT (language model), Transformer-XL, XLNet, and recurrent constructions like LSTM, GRU, and gated recurrent units used across Google Brain and OpenAI projects. Pooling operations include max pooling, average pooling, and global pooling patterns seen in architectures advocated by Yann LeCun, Geoffrey Hinton, Yoshua Bengio, Andrew Ng, and applied in systems by Amazon Web Services, Google Cloud Platform, Microsoft Azure, IBM Watson, Alibaba Cloud, and Tencent Cloud. Normalization layers include batch normalization and local response normalization used in research from Stanford University, MIT Computer Science and Artificial Intelligence Laboratory, Carnegie Mellon University, and University of Toronto.
cuDNN uses multiple algorithm implementations and workspace management to trade memory for speed, employing autotuning strategies and heuristics for algorithm selection resembling techniques in AutoML research and compiler-assisted optimizations from LLVM and XLA (Accelerated Linear Algebra). Optimizations exploit specialized hardware like Tensor Cores and interconnects such as NVLink and techniques from high-performance computing centers like Oak Ridge National Laboratory, Lawrence Livermore National Laboratory, and Argonne National Laboratory. Performance profiling commonly uses tools and ecosystems including NVIDIA Nsight Systems, NVIDIA Nsight Compute, CUDA Profiler, Intel VTune, perf (Linux), and observability stacks used by Kubernetes, Docker, Prometheus, Grafana, TensorBoard, and MLflow. Real-world deployments demonstrating cuDNN optimization are found in projects by Facebook, Google, Microsoft, Amazon, Baidu, Tencent, NVIDIA, and research outputs from ICML, NeurIPS, CVPR, ICCV, ECCV, ACL (conference), AAAI, SIGGRAPH, SIGCOMM.
Bindings and wrapper projects allow cuDNN to be used from languages and runtimes including Python (programming language), C++] ], C (programming language), Java (programming language), Go (programming language), Rust (programming language), Julia (programming language), R (programming language), MATLAB, Swift (programming language), and projects such as PyCUDA, CuPy, Theano, Torch7, MXNet bindings, and integration layers in ONNX and ONNX Runtime. Higher-level frameworks call cuDNN through adapter libraries maintained by organizations like NVIDIA, Apache Software Foundation, Facebook AI Research, Google, Microsoft Open Source, and community projects hosted on GitHub and GitLab.
cuDNN's release cadence has followed NVIDIA's GPU architecture roadmap and CUDA Toolkit releases, with milestones aligning to compute capability changes introduced in Kepler (microarchitecture), Maxwell (microarchitecture), Pascal (microarchitecture), Volta (microarchitecture), Turing (microarchitecture), Ampere (microarchitecture), and Ada Lovelace (microarchitecture). Major additions over time included support for tensor cores, fused kernels, mixed-precision training, improved LSTM/GRU primitives, expanded convolution algorithms, and compatibility improvements with frameworks such as TensorFlow, PyTorch, MXNet, Caffe2, and deployment tools like TensorRT and NVIDIA Triton Inference Server. The library's evolution is discussed in conference talks and technical reports from NVIDIA GTC, NeurIPS, ICML, and vendor documentation presented at events by Intel Corporation, AMD, ARM Limited, and academic collaborators at Stanford University, MIT, UC Berkeley, and Carnegie Mellon University.
Category:Software