STONITH — LLMpedia

STONITH
Name	STONITH
Caption	Cluster fencing concept
Developer	Various open-source and commercial vendors
Released	1990s–2000s
Operating system	Unix-like, Linux, Windows
License	Varies (open-source and proprietary)

Contents

Overview
Purpose and Functionality
Implementation Methods
Integration with Cluster Managers
Hardware and Software Devices
Configuration and Best Practices
Security and Safety Considerations
History and Standards

STONITH

STONITH is a cluster fencing technique used to forcibly isolate malfunctioning nodes in high-availability computing environments. It appears in discussions of cluster resource management, fault tolerance, and system administration where abrupt power or reset actions are required to protect shared resources and data integrity. The term is most commonly encountered in implementations of distributed systems, storage clusters, and virtualization stacks.

Overview

STONITH is a defensive mechanism applied in clustered computing contexts such as those managed by Linux, Red Hat Enterprise Linux, Debian, Ubuntu, SUSE Linux Enterprise Server, CentOS, Oracle Linux, Microsoft Windows Server, VMware ESXi, Proxmox VE, and OpenStack. It interacts with cluster managers like Pacemaker, Corosync, Heartbeat, Keepalived, and orchestration systems including Kubernetes, OpenShift, and Apache Mesos. Administrators use it alongside storage systems such as Red Hat Gluster Storage, Ceph, NetApp, EMC, and IBM Spectrum Scale to prevent split-brain scenarios and protect persistent volumes. STONITH is referenced in literature from vendors like Canonical, SUSE, Red Hat, and community projects such as ClusterLabs.

Purpose and Functionality

The core purpose is to ensure data safety by guaranteeing that a failed or unresponsive node is no longer able to access shared resources. Typical failure modes include hung kernels, network partitioning observed in environments like Amazon Web Services, Google Cloud Platform, Microsoft Azure, or on-premises clusters serving applications such as MySQL, PostgreSQL, MongoDB, Redis, Apache Cassandra, and Apache Kafka. Functionally, STONITH instructs an out-of-control node to be powered off, reset, or isolated via remote-controlled devices produced by companies including APC by Schneider Electric, Dell EMC, Hewlett Packard Enterprise, IBM, and Supermicro. It complements quorum mechanisms exemplified by etcd, Zookeeper, and Corosync quorum to preserve cluster consistency and prevent conflicting resource ownership.

Implementation Methods

Implementations range from soft fencing to hard fencing. Soft fencing involves techniques provided by SSH (Secure Shell), IPMI, Redfish, SNMP, and hypervisor APIs such as those in Xen Project, KVM, Microsoft Hyper-V, and VMware vSphere. Hard fencing uses power control via IPMI controllers, BMC interfaces, remote power distribution units from APC, and networked power switches. Other methods include fabric isolation using Fibre Channel zoning, SAN-level LUN masking offered by Brocade Communications Systems, Cisco Systems, and HPE 3PAR, or storage array commands from vendors like NetApp and Dell EMC Unity. Cluster resource agents implement these methods through modules maintained by ClusterLabs and vendor-specific tooling.

Integration with Cluster Managers

Cluster managers such as Pacemaker, Corosync, Heartbeat, DRBD, and STONITH infrastructure plug into fencing agents to make automated decisions. The cluster stack uses policies defined by administrators and projects like Linux-HA to decide when to fence, using watchdog integration like systemd watchdog timers or IPMI sensor thresholds. Integration points include resource agents from OCF (Open Cluster Framework), fencing components from Fence-agents packages, and management consoles from Red Hat Cluster Suite and SUSE Manager. In cloud-managed clusters on AWS EC2, Google Compute Engine, or Azure Virtual Machines, fencing often maps to provider APIs to terminate or isolate instances.

Hardware and Software Devices

Hardware devices used include IPMI-capable BMCs, Lights-Out Management cards such as iLO (Integrated Lights-Out), vendor power distribution units from APC, and serial console multiplexer units. Software devices include fence_ipmilan, fence_virt, fence_xvm, fence_kubevirt, fence_gce, fence_aws, and vendor-specific agents maintained by Red Hat and SUSE. Storage-related fencing may leverage SCSI-3 Persistent Reservations, FCoE, and array management interfaces from NetApp ONTAP or EMC VNX for LUN offlining. Management frameworks such as Ansible, Puppet, Chef, and SaltStack can orchestrate fencing actions in hybrid environments.

Configuration and Best Practices

Best practices recommend authoritative fencing policy configuration, testing in staging environments, and documenting recovery workflows for operators at organizations like NASA, CERN, Facebook, Twitter, and Netflix that operate large clusters. Administrators should use multiple, independent fencing methods when possible, configure timeouts conservatively, and prefer hard fencing in environments with shared-block storage used by Oracle Database, Microsoft SQL Server, or clustered filesystems like GFS2, OCFS2, and CephFS. Implement role-based access control through LDAP or Active Directory for management consoles and monitor fencing actions via logging stacks such as ELK Stack or Prometheus to support audits by compliance frameworks like SOC 2.

Security and Safety Considerations

Fencing devices expose powerful control planes; secure configuration is critical. Use authenticated and encrypted protocols such as IPMIv2 with cipher protections, Redfish over TLS, and restrict management interfaces via IEEE 802.1X or VPN gateways. Maintain firmware and patching policies aligned with advisories from US-CERT, CISA, and vendor security notices from Dell, HPE, Lenovo, and Supermicro. Implement safeguards to prevent accidental mass power actions and employ change control processes used in enterprises like Goldman Sachs or JP Morgan when planning fencing-related automation.

History and Standards

Fencing practices evolved alongside clustering research at institutions and projects such as Carnegie Mellon University, Lawrence Berkeley National Laboratory, Sun Microsystems, and early high-availability work embodied in Unix clustering. Standards and de facto interfaces emerged from specifications like IPMI, SCSI, and vendor-led APIs including Redfish. Open-source ecosystems consolidated fencing agents and policies through projects such as Linux-HA, ClusterLabs, and distributions maintained by Canonical, Red Hat, and SUSE.

Category:High availability