PBS (software) — LLMpedia

PBS (software)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	PBS
Developer	Altair Engineering (originally Veridian, later Altair PBS Works)
Released	1991
Latest release	Iterative enterprise versions (e.g., PBS Professional 19.x–23.x)
Programming language	C, C++
Operating system	Linux, UNIX, macOS (clients), Windows (clients)
Genre	Job scheduler, workload manager, cluster manager, high-performance computing
License	Open source (Community editions), proprietary (commercial editions)

Contents

Overview
History and Development
Architecture and Features
Usage and Applications
Performance and Scalability
Licensing and Adoption

PBS (software) is a family of workload management and job scheduling systems designed for high-performance computing clusters, supercomputers, and distributed compute farms. It orchestrates batch job submission, resource allocation, queuing, and execution across heterogeneous nodes, aiming to maximize utilization and fairness for scientific, engineering, and enterprise workloads. The software lineage has been influential in academic research centers, national laboratories, and commercial HPC environments.

Overview

PBS implementations provide a centralized scheduler, queue management, and daemons that run on controller and compute nodes to manage job lifecycle, data staging, and policy enforcement. Typical deployments integrate with cluster resource managers, parallel libraries, and storage systems used at institutions such as Lawrence Livermore National Laboratory, Oak Ridge National Laboratory, Argonne National Laboratory, CERN, and Los Alamos National Laboratory. Administrators define queues, reservations, and access controls, while users submit batch scripts or interactive sessions via command-line tools and graphical portals developed by vendors and third-party projects. PBS variants interoperate with resource managers and ecosystem tools like Slurm Workload Manager adapters, Torque (software) compatibility layers, and monitoring suites from Prometheus (software), Grafana, and commercial vendors.

History and Development

The PBS lineage began in the early 1990s with academic and laboratory efforts to manage compute-intensive workloads on clusters at institutions such as National Energy Research Scientific Computing Center and Lawrence Berkeley National Laboratory. Over time, commercial stewardship moved through companies and consortiums including Adaptive Computing, Altair Engineering, and independent open-source forks and community editions. PBS evolved alongside contemporaries like Load Sharing Facility and Sun Grid Engine, responding to scaling demands from projects such as Human Genome Project computations and climate modeling for programs at NASA and European Centre for Medium-Range Weather Forecasts. Governance and licensing shifted between permissive community releases and proprietary enterprise editions, reflecting broader changes in HPC procurement at Department of Energy facilities and university centers.

Architecture and Features

The architecture centers on a server/controller daemon that mediates job submission, a scheduler that enforces policy and backfilling, and node agents that launch and track processes on compute nodes. Core features include resource specification for CPUs, memory, GPUs, and licenses; job arrays for parameter sweeps used in workflows like those from National Institutes of Health funded projects; pre- and post-execution hooks for data staging with systems such as Globus; and advanced reservation capabilities for conferences, collaborations, and time-sensitive allocations at sites like European Organization for Nuclear Research. Integration points include authentication through Lightweight Directory Access Protocol directories, accounting exports compatible with XDMoD, and plugin architectures to support custom schedulers, fair-share algorithms, and energy-aware policies for centers using Green500 metrics. High-availability configurations use fencing and failover techniques common to large-scale deployments at research centers including Fermilab.

Usage and Applications

Researchers, engineers, and data scientists employ PBS systems to run simulations, batch analytics, and distributed rendering for projects associated with institutions such as Stanford University, Massachusetts Institute of Technology, University of California, Berkeley, and companies like Boeing and General Electric. Common application domains include computational fluid dynamics for NASA missions, molecular dynamics in collaboration with National Institutes of Health consortia, seismic imaging for energy companies, and machine learning model training in partnerships with technology firms such as NVIDIA. Workflow managers and scientific gateways—examples include those built on Apache Airflow or university cyberinfrastructure platforms—often submit tasks to PBS via REST APIs, connectors, or composer tools developed by cloud and HPC integrators.

Performance and Scalability

PBS implementations have been benchmarked and tuned for clusters ranging from tens to hundreds of thousands of cores. Scalability strategies include hierarchical scheduling, multi-threaded server processes, and efficient job array handling to reduce per-job overhead for ensembles and parameter sweeps used in climate modeling projects linked to Intergovernmental Panel on Climate Change assessments. Performance tuning often involves tuning communication with interconnects such as InfiniBand and RDMA-capable fabrics used at national facilities, optimizing node launch latencies for MPI jobs reliant on libraries like Open MPI and MPICH, and leveraging topology-aware allocation for NUMA and GPU-heavy workloads with devices from vendors like NVIDIA and AMD. Empirical deployments demonstrate that scheduling policy and accounting integration are as important as raw submission throughput for site-level efficiency metrics reported by consortia such as TOP500.

Licensing and Adoption

PBS distributions are offered under mixed licensing models: community editions released under open-source licenses and commercial enterprise editions with proprietary support, training, and integration services. This dual model has enabled wide adoption across academic consortia, national laboratories, and enterprises that require vendor-backed SLAs from providers like Altair Engineering and consulting firms specializing in HPC stack deployment. Adoption decisions often weigh support for compliance frameworks, export controls at facilities like Sandia National Laboratories, and compatibility with procurement at university computing centers such as those at Princeton University and University of Cambridge. The software’s presence in both community-driven projects and vendor stacks contributes to an ecosystem of tools, documentation, and third-party integrations used across the global HPC community.

Category:High-performance computing software