Slurm (software) — LLMpedia

Slurm (software)
Name	Slurm
Developer	Lawrence Livermore National Laboratory, SchedMD
Released	2003
Programming language	C (programming language), C++, Python (programming language)
Operating system	Linux
License	GNU General Public License

Contents

Overview
Architecture and Components
Job Scheduling and Resource Management
Installation and Configuration
Security and Authentication
Performance and Scalability
Development and Community

Slurm (software) is an open-source, fault-tolerant, highly scalable job scheduler and workload manager for large and small compute clusters. It orchestrates batch, array, and interactive workloads across nodes provided by vendors such as Cray (company), Hewlett Packard Enterprise, and Dell Technologies and is widely used at national laboratories including Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, and Argonne National Laboratory. Slurm integrates with resource managers, filesystem technologies, and telemetry systems from projects like Lustre (file system), GPFS, and Prometheus (software).

Overview

Slurm originated at Lawrence Livermore National Laboratory and evolved through collaborations with organizations including Sandia National Laboratories and University of California, Berkeley. It competes with other workload managers such as PBS Professional, HTCondor, LSF (software), and Grid Engine while emphasizing modularity and plugins. Slurm provides centralized accounting, flexible partitioning, advance reservations, and energy-aware scheduling used by supercomputers in facilities like Oak Ridge Leadership Computing Facility and projects funded by the U.S. Department of Energy.

Architecture and Components

Slurm’s design centers on a set of cooperating daemons and components: the controller daemon (slurmctld), compute node daemon (slurmd), and database daemon (slurmdbd). The controller coordinates across nodes, integrates with hardware management systems such as Intel Corporation management tools, and interfaces with workload submission clients. Plugins implement modules for scheduling, job completion, prolog/epilog handling, and task communication; this plugin model enables integration with Ansible, Puppet (software), and cluster provisioning systems employed at facilities like National Energy Research Scientific Computing Center.

Job Scheduling and Resource Management

Slurm schedules jobs based on policies implemented in scheduler plugins supporting backfill, fair-share, priority multifactor, and preemption. It manages advanced features including job arrays, reservations, heterogeneous resource allocation, and topology-aware placement leveraging information from InfiniBand fabrics and node topology databases common in systems by NVIDIA and AMD. Accounting and fair-share are recorded to relational backends such as MariaDB or PostgreSQL via slurmdbd for reporting to portals similar to those used by XSEDE and university computing centers.

Installation and Configuration

Slurm is typically installed from source or vendor packages and configured with slurm.conf, cgroup integration, and node definitions. Administrators automate deployments using configuration management tools like Salt (software), Chef (software), and Ansible to provision compute, GPU, and accelerator resources across clusters named in consortia such as PRACE and EuroHPC. Integration with resource managers and telemetry requires tuned kernel parameters for Linux distributions provided by vendors like Red Hat and SUSE (software).

Security and Authentication

Slurm supports authentication mechanisms and role-based controls interoperable with identity infrastructures including Kerberos, LDAP, and site-specific single sign-on systems used at institutions like CERN and Brookhaven National Laboratory. It provides privileges separation between controller and compute daemons, and enforces resource limits with cgroups and namespaces, aligning with site policies and compliance programs at agencies such as the National Science Foundation and U.S. Department of Energy.

Performance and Scalability

Slurm has been demonstrated at exascale-class deployments and on machines awarded top rankings in the TOP500 list. Performance tuning involves scheduler plugin selection, database placement for accounting, and network fabric considerations for low-latency control plane interactions on interconnects such as Mellanox Technologies InfiniBand. Scalability experiments and regressions are tracked in collaboration with centers like NERSC and OARC, and community-contributed patches improve throughput for thousands of nodes.

Development and Community

Slurm development is driven by a mix of national laboratories, commercial contributors such as SchedMD, and academic partners including University of California campuses. The project maintains a roadmap, plugin API, and contribution process used by contributors from organizations like LLNL, Sandia National Laboratories, and commercial HPC integrators. Community resources include mailing lists, workshops at conferences such as SC Conference and PEARC, and collaboration with open-source ecosystems like OpenHPC.

Category:Free and open-source software Category:Batch queuing systems Category:High-performance computing