cluster (computing)

cluster (computing)
AI-generated (Stable Diffusion 3.5) · CC BY 4.0 · source
Name	Cluster (computing)
Type	Architecture
Introduced	1980s
Developer	Various
Platform	Various

Contents

Overview
Architecture and Components
Types of Clusters
Cluster Management and Software
Performance, Scalability, and Reliability
Use Cases and Applications
History and Evolution

cluster (computing)

A computing cluster is a set of interconnected supercomputer-class or commodity IBM-style nodes that operate as a unified computational resource. Clusters combine processors, storage, networking, and software to provide higher throughput, availability, or cost-effectiveness than single systems; they have been deployed in environments ranging from the Lawrence Berkeley National Laboratory and CERN to cloud datacenters run by Amazon Web Services, Microsoft Azure, and Google Cloud Platform. Modern clusters integrate innovations from projects at Los Alamos National Laboratory, Sandia National Laboratories, and commercial vendors such as Intel, NVIDIA, and Red Hat.

Overview

A cluster aggregates multiple computers to present a coherent service for compute, storage, or both; typical goals include improved performance, fault tolerance, and manageability. Clusters often use high-speed interconnects developed by companies like Mellanox Technologies and standards promoted by organizations such as the Open Compute Project and The Linux Foundation. Administrators apply orchestration frameworks inspired by initiatives from Berkeley Software Distribution-era research and later operationalized by projects like Kubernetes and OpenStack.

Architecture and Components

Cluster architectures consist of nodes, interconnects, storage subsystems, and management layers. Nodes may be homogeneous or heterogeneous, built on processors from AMD, ARM, or Intel and accelerators from NVIDIA, AMD GPUs or Google TPUs. Interconnects include Ethernet, InfiniBand, and proprietary fabrics developed by Cray Research and modern vendors; storage may employ parallel file systems such as Lustre and GPFS (now IBM Spectrum Scale). Management components include boot services, configuration management from vendors like Puppet and Ansible, and monitoring stacks influenced by Prometheus and Nagios. Security and identity integration often reference standards from IETF and implementations by Red Hat and Microsoft.

Types of Clusters

Clusters are classified by purpose and architecture. High-performance computing (HPC) clusters, exemplified by systems at Oak Ridge National Laboratory and Argonne National Laboratory, prioritize low-latency interconnects and parallel file systems for scientific computing. High-availability clusters protect services in enterprises such as Goldman Sachs and Walmart by using failover mechanisms from vendors like Microsoft and F5 Networks. Load-balancing clusters support web-scale services run by Facebook and Twitter using software load balancers influenced by projects at Netflix. Storage clusters are implemented by companies like EMC Corporation and NetApp with distributed filesystems inspired by research from Google's publications. Specialized clusters include GPU clusters for machine learning used by OpenAI and DeepMind, and microservice clusters orchestrated with Kubernetes for platforms such as Spotify.

Cluster Management and Software

Cluster management encompasses provisioning, orchestration, scheduling, and monitoring. Job schedulers like Slurm Workload Manager and HTCondor allocate HPC workloads at facilities including Fermilab and CERN. Cloud-native clusters use orchestration from Kubernetes and service meshes influenced by Istio to manage containers at scale for companies like Dropbox and Airbnb. Configuration and lifecycle tools from Chef (software) and Terraform automate deployments across providers such as Amazon Web Services and Google Cloud Platform. Monitoring and telemetry adopt stacks involving Prometheus, Grafana, and log aggregation influenced by Elastic.

Performance, Scalability, and Reliability

Performance engineering for clusters draws on research and practice from Amdahl's law, studies at Massachusetts Institute of Technology, and industrial benchmarks produced by groups like SPEC. Scalability depends on network topology, algorithmic parallelism, and storage design; topologies range from fat-trees pioneered by Stanford University research groups to torus and dragonfly networks commercialized by Cray Inc.. Reliability techniques include redundancy, consensus algorithms like Paxos and Raft, and checkpoint/restart systems developed in collaboration between institutions such as Los Alamos National Laboratory and vendors like Hewlett Packard Enterprise. Energy efficiency efforts reference programs at Lawrence Berkeley National Laboratory and partnerships with DOE initiatives.

Use Cases and Applications

Clusters underpin scientific computing at institutions such as National Aeronautics and Space Administration and European Space Agency, enable big data analytics in firms like Bloomberg L.P. and Spotify, and support financial risk modeling at Morgan Stanley and JPMorgan Chase. Machine learning training for organizations like OpenAI and DeepMind uses GPU clusters integrated with frameworks from TensorFlow and PyTorch. Content delivery and streaming platforms operated by Netflix and YouTube rely on clusters for encoding, caching, and distribution. Enterprise resource planning and databases run clustered deployments from vendors including Oracle Corporation and Microsoft.

History and Evolution

Cluster computing evolved from early parallel machines and academic projects in the 1980s and 1990s, including work at Lawrence Livermore National Laboratory and research influenced by the ARPANET era. The move from specialized massively parallel processors to commodity-based clusters was accelerated by initiatives at Berkeley and publications by researchers associated with DARPA and NSF. Commercialization followed with companies such as Sun Microsystems, HP, and IBM supporting cluster products; later cloud providers including Amazon Web Services and Google transformed deployment models. Recent evolution emphasizes container orchestration from Google engineers, hardware acceleration from NVIDIA and Intel, and open hardware efforts like the Open Compute Project.

Category:Distributed computing