SLURM — LLMpedia

SLURM
Name	SLURM
Developer	Lawrence Livermore National Laboratory; contributors include SchedMD, Red Hat, Cray Inc., IBM
Initial release	2003
Operating system	Linux, FreeBSD
Platform	x86-64, ARM, POWER
Genre	Job scheduler, resource manager
License	GNU General Public License

Contents

Overview
Architecture and Components
Job Scheduling and Resource Management
Configuration and Administration
Performance and Scalability
Security and Accounting
History and Development

SLURM is an open-source job scheduler and resource manager widely used on high-performance computing clusters, supercomputers, and research grids. It orchestrates allocation of compute nodes, schedules batch and interactive workloads, and integrates with system monitoring and accounting subsystems. SLURM is deployed at national laboratories, universities, and commercial centers to support scientific computing projects and large-scale simulations.

Overview

SLURM coordinates resources and jobs across clusters such as deployments at Oak Ridge National Laboratory, Argonne National Laboratory, Lawrence Berkeley National Laboratory, Los Alamos National Laboratory, and National Energy Research Scientific Computing Center. It interfaces with software ecosystems including MPI, OpenMP, CUDA, Singularity and Kubernetes-based workflows. Administrators commonly pair SLURM with provisioning tools like xCAT, Ansible, Puppet, and Chef and monitoring stacks such as Prometheus, Grafana, Nagios, and Ganglia. SLURM integrates with identity providers like LDAP and Kerberos for authentication and with storage systems including Lustre, GPFS (IBM Spectrum Scale), Ceph, and NFS.

Architecture and Components

Core components include the slurmctld controller, slurmd daemons on compute nodes, and client utilities. The architecture supports modular plugins for schedulers, authentication, accounting, and job submission. Common plugins originate from projects and organizations like SchedMD, Cray Inc., Hewlett Packard Enterprise, Dell EMC, and IBM. SLURM components interact with hardware and firmware layers such as Intel Xeon, AMD EPYC, NVIDIA Tesla, and ARM server platforms, and management controllers like BMC implementations including IPMI and Redfish. Integrations extend to batch systems and workload managers like HTCondor, LSF, Torque, and PBS Professional for federated job submission.

Job Scheduling and Resource Management

Scheduling policies in SLURM include backfill, fair-share, priority, and multifactor algorithms that administrators tune for institutional policies or funding constraints from agencies such as DOE, NSF, and NASA. SLURM schedules MPI jobs using resource specifications compatible with libraries like OpenMPI, MPICH, Intel MPI, and accelerators managed by CUDA and ROCm. Resource descriptors map to node features and partitions tied to procurement projects from vendors like Cray, HPE, Dell Technologies, and Lenovo. SLURM supports reservations, job arrays, job dependencies, and advanced placement features useful to research groups working with initiatives like Blue Waters, Summit, Fugaku, and Frontier.

Configuration and Administration

Administrators configure SLURM with slurm.conf, cgroup settings, and plugin modules to meet site policies at institutions such as CERN, Rutherford Appleton Laboratory, Max Planck Society, and Tokyo Institute of Technology. Day-to-day administration uses tools and APIs compatible with languages and frameworks like Python, Perl, Go, and Java to automate provisioning, integration, and lifecycle management. Change management often references practices from ITIL and security guidelines such as those from NIST and CIS benchmarks. Backup and high-availability configurations involve technologies like Corosync, Pacemaker, and distributed databases including MariaDB, PostgreSQL, and MySQL.

Performance and Scalability

SLURM scales to tens of thousands of nodes and hundreds of thousands of cores in deployments at facilities including Oak Ridge Leadership Computing Facility and Argonne Leadership Computing Facility. Performance tuning addresses latency-sensitive workloads from projects like climate modeling, computational chemistry, and cosmology that use packages such as GROMACS, NAMD, LAMMPS, WRF, and Enzo. Network considerations involve interconnects and fabrics such as InfiniBand, Omni-Path, Ethernet, and Intel OPA, with optimizations leveraging RDMA, topology-aware allocation, and NUMA affinity on platforms like SUMMIT and CORAL systems. Benchmarking and profiling integrate with suites and tools like HPL, HPCC, Intel VTune, TAU, and perf.

Security and Accounting

Authentication and authorization integrate with Kerberos, LDAP, and federated identity protocols in collaboration with institutions such as ESnet and Internet2. SLURM supports job-level isolation using cgroups and namespaces together with container runtimes including Singularity and Docker where policies allow. Accounting uses slurmdbd and can export records to systems like XDMoD, ELK Stack (Elasticsearch, Logstash, Kibana), and institutional billing systems used by centers funded by DOE Office of Science and NSF Directorate for Computer and Information Science and Engineering. Compliance and auditing align with standards from NIST SP 800 publications and organizational policies at Department of Energy facilities.

History and Development

Development began at Lawrence Livermore National Laboratory in the early 2000s, with contributions from the community and organizations like SchedMD, which commercialized support and services. Over time SLURM evolved through releases addressing scalability, plugin extensibility, and integration with emerging technologies from vendors such as NVIDIA, Intel, AMD, HPE, and Cray. The project has been discussed in conferences and workshops hosted by SC Conference, ISC High Performance, PEARC, and publications associated with ACM and IEEE communities. Commercial support and ecosystem growth include partnerships with Canonical, Red Hat, SUSE, and cloud providers exploring HPC offerings like Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

Category:Job scheduling software