Slurm Workload Manager

Slurm Workload Manager
Name	Slurm Workload Manager
Developer	SchedMD
Released	2003
Operating system	Linux
License	GNU General Public License

Contents

Overview
Architecture and Components
Installation and Configuration
Job Submission and Scheduling
Resource Management and Accounting
Security and Authentication
Performance, Scalability, and Use Cases

Slurm Workload Manager

Slurm Workload Manager is an open-source cluster management and job scheduling system designed for high-performance computing environments affiliated with organizations like Lawrence Livermore National Laboratory, Argonne National Laboratory, Oak Ridge National Laboratory, Sandia National Laboratories. It is widely adopted by supercomputing centers such as National Energy Research Scientific Computing Center, Fermi National Accelerator Laboratory, Los Alamos National Laboratory, CERN and used in installations on systems comparable to Hopper (supercomputer), Summit (supercomputer), Frontier (supercomputer). The project was initiated to support allocation, queuing, and orchestration on large-scale clusters used by projects like Human Genome Project, Large Hadron Collider, Blue Waters.

Overview

Slurm provides workload scheduling, resource management, and job orchestration for clusters operated by institutions including National Aeronautics and Space Administration, National Institutes of Health, Department of Energy, European Organization for Nuclear Research. It competes with systems such as TORQUE, LSF (software), PBS (software), Grid Engine and integrates with ecosystems around MPI, OpenMP, CUDA, OpenCL. Major adopters include national laboratories and universities like Stanford University, Massachusetts Institute of Technology, University of California, Berkeley and industry players including Intel Corporation, NVIDIA, Google clusters for research collaborations.

Architecture and Components

The architecture separates control and execution with daemons and utilities analogous to designs used by Apache Hadoop, Kubernetes, Apache Mesos. Key components include the central controller comparable to services at Amazon Web Services, Microsoft Azure control planes; compute node daemons similar to agents in Docker Swarm; database-backed accounting resembling PostgreSQL or MySQL deployments at Facebook or Twitter; and client utilities akin to command-line tools from GitHub and OpenStack. Components interact with hardware vendors like IBM, Cray Research, Hewlett-Packard Enterprise and network fabrics from Mellanox Technologies to allocate CPUs, GPUs, and fabric resources in concert with software stacks from NVIDIA, AMD, Intel.

Installation and Configuration

Installation practices mirror enterprise deployments at Red Hat, Ubuntu (operating system), SUSE Linux Enterprise Server and cloud integrations with Amazon EC2, Google Cloud Platform, Microsoft Azure. Configuration files, service units, and packaging follow patterns used by systemd and RPM Package Manager or Debian (operating system). Cluster operators from institutions like Princeton University, University of Texas combine role-based access setups informed by policies at National Science Foundation and credential management akin to systems at MITRE and Carnegie Mellon University.

Job Submission and Scheduling

Job submission interfaces and schedulers support models compared to SLIP, PBS Professional, and scheduling algorithms used in projects at Los Alamos National Laboratory and Lawrence Berkeley National Laboratory. Users submit batch and interactive jobs via commands used similarly in environments at Harvard University and Yale University. Scheduling policies implement priority, fairshare, and backfill strategies influenced by research from Massachusetts Institute of Technology, Stanford University, and collaborations with vendors such as Cray Research and Hewlett-Packard Enterprise. Integration with workflow managers parallels systems like Apache Airflow, Nextflow, and Snakemake used in bioinformatics pipelines at Broad Institute.

Resource Management and Accounting

Resource management and accounting features support allocation tracking and usage reporting comparable to practices at Department of Energy supercomputing centers and enterprise clusters at IBM and Google. Accounting databases often use technologies from PostgreSQL, MySQL, or MariaDB as deployed by universities like Columbia University and labs like Argonne National Laboratory. Chargeback, quota, and accounting policies reflect standards adopted by agencies such as National Science Foundation and workflows coordinated with identity systems from CERN and Internet2.

Security and Authentication

Security and authentication integrate with identity providers and directory services such as LDAP, Active Directory, Kerberos used widely at Princeton University, Stanford University, and national labs. Best practices mirror controls from NIST frameworks and enterprise authentication used at Microsoft Corporation and Oracle Corporation. Secure deployments consider host-level hardening strategies similar to guidance from NSA and compliance workflows aligned with Federal Information Security Management Act.

Performance, Scalability, and Use Cases

Slurm scales to large deployments in line with systems at Oak Ridge National Laboratory and Argonne National Laboratory and is selected for workloads in domains like computational chemistry used at Pfizer, climate modeling at National Oceanic and Atmospheric Administration, and astrophysics collaborations with NASA and European Space Agency. Performance tuning borrows methods from high-performance computing research at Lawrence Berkeley National Laboratory and vendor guidance from NVIDIA, Intel, AMD for CPU, GPU, and network optimization. Use cases range from job arrays in genomic sequencing at Broad Institute to ensemble simulations in finance at Goldman Sachs and large-scale machine learning training at Google and Facebook.

Category:Job scheduling systems