Chaos Monkey

Chaos Monkey
Name	Chaos Monkey
Developer	Netflix
Released	2011
Operating system	Cross-platform
Genre	Resilience engineering tool
License	Apache License 2.0

Contents

Overview
History and development
Functionality and operation
Impact and adoption
Related tools and ecosystem

Chaos Monkey. It is a tool developed by Netflix as part of its Simian Army suite, designed to intentionally disrupt services in a cloud computing environment to test system resilience. By randomly terminating virtual machine instances and containers, it forces engineers to build services that can withstand such failures without impacting customers. The tool's philosophy is rooted in the concept of chaos engineering, which proactively seeks to uncover systemic weaknesses before they cause outages.

Overview

Chaos Monkey operates on the principle that failures in a distributed system are inevitable, especially within large-scale infrastructures like those at Amazon Web Services or Microsoft Azure. Its primary function is to automatically and randomly disable production instances to ensure that the overall architecture is fault-tolerant. This approach encourages the design of systems that adhere to best practices in high availability and disaster recovery, moving beyond traditional quality assurance methods. The tool has become a foundational component in the practice of validating service-level agreements under real-world duress.

History and development

The tool was created internally by Netflix engineers, including Ariel Tseitlin and Cory Bennett, around 2011 as the company migrated its entire infrastructure to Amazon Web Services. This transition from a data center-based monolithic application to a microservices architecture in the public cloud necessitated a new approach to reliability testing. Inspired by the concept of a "monkey" causing random havoc, the team open-sourced the software in 2012 under the Apache License 2.0. Its development was closely tied to the creation of other tools in the Simian Army, such as Latency Monkey and Chaos Gorilla, each targeting different failure modes.

Functionality and operation

Chaos Monkey is typically configured to run during business hours and targets instances that are part of an auto-scaling group. It identifies healthy instances via application programming interface calls to cloud providers like Amazon Web Services or Google Cloud Platform and then terminates them. The tool integrates with Spinnaker, Netflix's continuous delivery platform, and can be managed through a web interface or command-line interface. Its operation is governed by configurable parameters, such as the probability of attack and exemption lists for critical services, ensuring controlled experimentation. This process validates the effectiveness of load balancing and failover mechanisms within the system.

Impact and adoption

The release and promotion of Chaos Monkey significantly influenced the DevOps and site reliability engineering communities, popularizing the discipline of chaos engineering. Major technology firms, including Amazon, Google, Microsoft, and LinkedIn, have developed similar internal tools or adopted the principles. It has been cited in numerous talks at conferences like AWS re:Invent and Velocity Conference as a best practice for building resilient cloud-native applications. The widespread adoption of its concepts is reflected in commercial offerings from companies like Gremlin Inc. and services within Azure Chaos Studio.

Chaos Monkey is the most famous member of the broader Simian Army, which includes tools like Latency Monkey for introducing network delay, Chaos Gorilla for simulating an entire availability zone failure, and Security Monkey for identifying policy violations. The success of these tools led to the creation of the open-source platform Chaos Toolkit and commercial products such as Gremlin. Other significant projects in this domain include Litmus from MayaData for Kubernetes environments and Toxiproxy from Shopify for network failure simulation, collectively forming a robust ecosystem for resilience testing.

Category:Cloud computing Category:Software testing Category:Netflix (company)

Overview

History and development

Functionality and operation

Impact and adoption

Related tools and ecosystem