SLURM (job scheduler)

SLURM (job scheduler)
Name	SLURM
Developer	SchedMD
Released	2003
Operating system	Linux
License	GNU General Public License

Contents

Overview
Architecture and Components
Job Submission and Scheduling
Resource Management and Accounting
Plugins and Extensibility
Deployment and Administration
Performance, Scalability, and Use Cases

SLURM (job scheduler) SLURM is an open-source workload manager and job scheduler designed for large-scale cluster and supercomputing environments. It is widely used on systems associated with organizations such as National Energy Research Scientific Computing Center, Oak Ridge National Laboratory, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Argonne National Laboratory. SLURM coordinates compute resources, queues, and job execution across hosts built by vendors like Dell Technologies, Hewlett Packard Enterprise, Cray Inc., IBM, and Fujitsu.

Overview

SLURM originated from development efforts led by programmers affiliated with Lawrence Livermore National Laboratory and was further advanced by companies such as SchedMD and collaborations with projects at National Science Foundation centers and DOE facilities. It competes and interoperates conceptually with systems like PBS Professional, Sun Grid Engine, Torque (software), and LSF (software), while being deployed on installations that include Frontera (supercomputer), Summit (supercomputer), Sierra (supercomputer), and research clusters at universities such as Massachusetts Institute of Technology, Stanford University, University of California, Berkeley, and University of Texas at Austin. The project’s governance and adoption intersect with initiatives from XSEDE, PRACE, and national laboratories.

Architecture and Components

SLURM’s architecture separates control plane and compute plane with principal daemons and utilities. Core components include the central controller daemon slurmctld, compute node daemon slurmd, and database daemon slurmdbd; these correspond to roles found in other environments like Beowulf (cluster architecture) and concepts used by Kubernetes for orchestration. The system relies on configuration files, resource plugins, and message passing similar to middleware used at Los Alamos National Laboratory and Argonne National Laboratory centers. Hardware integration spans interconnects such as InfiniBand, Omni-Path (microarchitecture), and networking equipment from Mellanox Technologies and Intel Corporation.

Job Submission and Scheduling

Users submit batch, array, and interactive jobs via command-line tools modeled alongside utilities from GNU Project toolchains and shells used at institutions like California Institute of Technology and Cornell University. SLURM scheduling policies support priorities, fair-share, backfill, and preemption strategies comparable to policy implementations at Sandia National Laboratories and Pacific Northwest National Laboratory. Advanced features include job arrays used by researchers at Princeton University and GPU-aware scheduling suitable for workloads run on systems like NVIDIA DGX appliances and clusters at Argonne Leadership Computing Facility.

Resource Management and Accounting

Resource allocation in SLURM tracks compute cores, memory, GPUs, and accelerators with bookkeeping integrated into databases and accounting systems used by centers such as National Center for Supercomputing Applications and European Centre for Medium-Range Weather Forecasts. The slurmdbd component records usage for billing, reporting, and chargeback mechanisms implemented at facilities like Oak Ridge Leadership Computing Facility and Science and Technology Facilities Council. Integration points exist with identity and access systems provided by providers like Microsoft Active Directory and authentication frameworks similar to deployments at Lawrence Berkeley National Laboratory.

Plugins and Extensibility

SLURM offers a plugin architecture enabling custom schedulers, job completion actions, and resource selection plugins; this extensibility is comparable to modular approaches in projects like Apache Hadoop and OpenStack. Plugins are leveraged by vendors and research groups including Cray Inc., Hewlett Packard Enterprise, and academic groups at University of Illinois Urbana–Champaign to implement features such as energy-aware scheduling, checkpoint/restart coordination with libraries used in Argonne National Laboratory studies, and integration with container runtimes from Docker and Singularity (software). The ecosystem includes third-party tooling from companies like Bright Computing and community projects hosted at repositories affiliated with GitHub.

Deployment and Administration

Administrators deploy SLURM on clusters composed of hardware from Supermicro, Dell Technologies, HPE Cray EX systems, and blade solutions similar to those used at Fermilab and CERN. Typical practices mirror site operations at Brookhaven National Laboratory and Riken, including high-availability configurations, Proxmox and virtualization strategies examined by teams at ETH Zurich, and monitoring integrations with systems like Prometheus and Nagios. Documentation and operational guidance are produced by SchedMD and academic groups at institutions such as University of Edinburgh and Rensselaer Polytechnic Institute.

Performance, Scalability, and Use Cases

SLURM is engineered for extreme-scale environments and is used on top-tier supercomputers including Frontera (supercomputer), Summit (supercomputer), and national research clusters at NERSC and Titan (supercomputer) legacy systems. Performance tuning involves node partitioning, cgroup integration, and network optimizations practiced at Oak Ridge National Laboratory and Argonne National Laboratory; scalability studies reference deployments that coordinate tens of thousands of compute nodes similarly to experiments run at Lawrence Livermore National Laboratory. Use cases span computational fluid dynamics workflows at NASA Ames Research Center, genomics pipelines at Broad Institute, machine learning training at Facebook AI Research, and climate modeling collaborations with NOAA and European Centre for Medium-Range Weather Forecasts.

Category:Job scheduling software