PBS Pro — LLMpedia

PBS Pro
Name	PBS Pro
Developer	Altair Engineering (formerly Adaptive Computing)
Released	1991
Latest release	2024
Programming language	C, C++
Operating system	Linux, UNIX, IBM AIX, Microsoft Windows (clients)
License	Commercial proprietary, OpenPBS roots

Contents

Overview
History and Development
Architecture and Features
Licensing and Commercial Support
Performance and Scalability
Adoption and Use Cases
Criticisms and Limitations

PBS Pro Portable Batch System Professional (PBS Pro) is a commercial workload management and job scheduler for high-performance computing clusters and supercomputers. It orchestrates job submission, resource allocation, queue management, and job execution across distributed systems, integrating with resource managers, parallel libraries, and monitoring tools. PBS Pro supports batch, array, and interactive jobs for research centers, national labs, and enterprises.

Overview

PBS Pro is a clustered workload manager used to schedule compute tasks on systems ranging from departmental clusters to national-scale supercomputers such as those at Oak Ridge National Laboratory, Lawrence Livermore National Laboratory, and Los Alamos National Laboratory. It interfaces with job submission tools like MPI implementations such as OpenMPI and Intel MPI, and with parallel filesystems like Lustre, GPFS (IBM Spectrum Scale), and BeeGFS. PBS Pro integrates with container runtimes including Singularity and Docker (via wrappers) and supports accounting integration with systems such as SLURM accounting converters and XDMoD.

History and Development

PBS Pro traces lineage to the original Portable Batch System developed at NASA Ames Research Center and later commercialized through entities that include MRG (Concord) and Adaptive Computing; acquisition activity involved firms such as Altair Engineering. PBS's intellectual history connects to early batch systems used at research centers like Lawrence Berkeley National Laboratory and projects such as Merrill Lynch (adoption in finance) that moved HPC tooling into industry. Development milestones mirror shifts in HPC: support for distributed resource management in the 1990s, scalable array jobs in the 2000s, and cloud-integrated scheduling in the 2010s, aligning with initiatives at National Energy Research Scientific Computing Center and procurement by agencies like Department of Energy laboratories.

Architecture and Features

PBS Pro follows a server/agent/commander architecture: a central scheduler server interacts with node-level daemons and submit/monitor clients. The scheduler supports policies such as fair-share, priority, and class-based queues used in environments like Argonne National Laboratory and Sandia National Laboratories. It provides job arrays, advance reservations, resource limits (cores, memory, GPUs), and suite integration with accelerators like NVIDIA Tesla and AMD Instinct. Authentication and authorization integrate with identity services such as LDAP, Active Directory, and workload provenance systems used by projects at CERN. The software exposes APIs and hooks for custom schedulers, enabling interoperability with workflow managers like HTCondor, Apache Airflow, and science gateways such as Galaxy (bioinformatics). Monitoring and telemetry work with tools like Prometheus, Grafana, and telemetry collectors used in Amazon Web Services HPC offerings and on-premises clouds like OpenStack.

Licensing and Commercial Support

PBS Pro is distributed under a commercial license by Altair Engineering, offering support plans, training, and professional services used by institutions including National Institutes of Health research centers and commercial R&D groups at General Electric and Boeing. Its licensing model contrasts with open-source schedulers such as SLURM and Torque (software), while retaining heritage from OpenPBS code and community. Commercial support typically covers installation, tuning for systems at Cray (HPE Cray) installations, integration with vendor stacks such as NVIDIA Magnum IO, and assistance for regulatory compliance in environments like Food and Drug Administration-regulated labs.

Performance and Scalability

PBS Pro targets low-latency job dispatch and high throughput for job arrays, with demonstrated deployments managing hundreds of thousands of cores on systems at Oak Ridge Leadership Computing Facility and reliability expectations akin to schedulers deployed at Blue Waters and Titan (supercomputer). Performance optimization often involves tuning kernel parameters on Red Hat Enterprise Linux and SUSE Linux Enterprise Server, network optimizations on InfiniBand fabrics, and integration with topology-aware placement strategies used on systems designed by vendors like Hewlett Packard Enterprise and Dell EMC. Benchmarks comparing PBS Pro to alternatives such as SLURM and LSF (IBM Spectrum LSF) focus on job start latency, throughput, and backfill efficiency.

Adoption and Use Cases

Adopters include national laboratories (e.g., Oak Ridge National Laboratory, Lawrence Livermore National Laboratory), universities with big compute clusters (e.g., University of Illinois, University of Texas), and industries such as aerospace and finance (e.g., Lockheed Martin, Goldman Sachs). Use cases span computational chemistry with packages like Gaussian (software), climate modeling with WRF (Weather Research and Forecasting model), genomics pipelines using GATK, and machine learning workloads using TensorFlow and PyTorch. PBS Pro is also used in hybrid cloud bursting scenarios with providers such as Google Cloud Platform, Microsoft Azure, and Amazon Web Services to extend on-premises capacity.

Criticisms and Limitations

Critics cite proprietary licensing and higher total cost of ownership compared to open-source alternatives like SLURM, HTCondor, and Torque (software), and discuss interoperability challenges when migrating to or from systems standardized on other schedulers such as LSF (IBM Spectrum LSF). Some users report complexity in configuring advanced features relative to controllers used in academic groups at institutions like Massachusetts Institute of Technology and Stanford University. Concerns have been raised about vendor lock-in for environments integrating with vendor-specific stacks from Cray (HPE Cray), NVIDIA, and IBM; conversely, proponents highlight enterprise support provided to facilities such as Argonne National Laboratory and Sandia National Laboratories.

Category:Job scheduling systems