Resilience and Testing: Chaos Engineering
1. Introduction: Embracing Chaos for a World-Class Platform
In our pursuit of a world-class, 99.99% available platform, we must move beyond reactive defense and proactively seek out weakness. Chaos Engineering is the discipline of experimenting on a system in order to build confidence in its capability to withstand turbulent conditions in production. It is a cornerstone of our resilience strategy, allowing us to turn unknown unknowns into known, manageable weaknesses.
This document outlines the principles, methodologies, and tools we will use to implement Chaos Engineering within the XOPS platform. Our approach is inspired by industry leaders like Netflix and adapted for our unique, AI-driven architecture.
The Goal: Not to break things, but to reveal hidden flaws, faulty assumptions, and blind spots in our system before they manifest as production outages.
2. Core Principles of Chaos Engineering at XOPS
We adhere to the following principles for all chaos experiments:
-
Hypothesize about Steady State: We must first define what "normal" looks like. Before injecting any failure, we articulate a clear, measurable hypothesis about the expected outcome, assuming the system is resilient.
- Example Hypothesis: "If the primary Knowledge Graph database replica fails, the system will execute a failover to the secondary replica within 30 seconds, and API latency will not increase by more than 10%."
-
Vary Real-World Events: Experiments should simulate realistic failures. This includes:
- Infrastructure Failures: EC2 instance termination, EBS volume failures, network latency/packet loss between availability zones.
- Application-Level Failures: API error responses (e.g., 503s), dependency service outages (e.g., Auth0), CPU/memory pressure.
-
Run Experiments in Production: While initial experiments will be run in a dedicated staging environment, the ultimate goal is to run them directly in production. This is the only way to be sure we are testing the real, user-facing system.
-
Automate Experiments to Run Continuously: Manual, infrequent experiments are not enough. We will build a library of automated chaos experiments that run continuously as part of our CI/CD pipeline and operational schedule, ensuring we are constantly validating our resilience.
-
Minimize the Blast Radius: We must always prioritize the customer experience. Experiments will start with the smallest possible impact (e.g., affecting a single internal user or a canary instance) and slowly expand as our confidence grows. All experiments must have a clear "stop" button.
3. The Chaos Experiment Workflow
Every chaos experiment follows a structured, five-step process, managed via Jira and integrated with our toolchain.
Step 1: Planning & Design (Game Day)
This is the most critical phase. We plan experiments during "Game Day" sessions.
- Objective: Identify a potential weakness and design an experiment to test it.
- Process:
- Gather the Team: Include SREs, developers from the relevant service, and product owners.
- Define the Steady State: What metrics indicate the system is healthy? (e.g., API success rate, latency p99, queue depths).
- Formulate Hypothesis: State clearly what you expect to happen.
- Design the Experiment:
- Type of Fault: What will you inject? (e.g., latency, error, resource exhaustion).
- Target: What component will you affect?
- Blast Radius: Who is affected? (Start small!).
- Abort Conditions: What metrics will trigger an immediate stop to the experiment?
- Create a Jira Ticket: Use the "Chaos Experiment" issue type to document everything.
Step 2: Execution
- Tools: We will use a combination of tools:
- Gremlin / Chaos Mesh: For infrastructure-level chaos (e.g., killing pods, network latency).
- Custom Scripts: For application-level chaos (e.g., forcing a specific API endpoint to return errors).
- Automation: Experiments are triggered via GitHub Actions, which call the appropriate tool's API.
Step 3: Observation & Monitoring
- Key Activity: Watch the dashboards!
- Primary Tool: New Relic. We will have a dedicated "Chaos Day" dashboard that visualizes the steady-state metrics and the impact of the experiment in real-time.
- Alerting: PagerDuty is configured to alert the on-call team if the experiment exceeds the defined blast radius or triggers an unexpected failure.
Step 4: Analysis & Learning
- Objective: Did the system behave as hypothesized?
- Process:
- Gather Metrics: Collect all relevant logs, traces, and metrics from the experiment.
- Compare to Hypothesis: If the system was resilient, celebrate! If not, you've found a weakness.
- Root Cause Analysis: Dig deep to understand why the failure occurred. Was it a missing timeout? A race condition? A faulty retry mechanism?
Step 5: Improvement
- Action Items: Create Jira tickets for the engineering teams to fix the uncovered issues. These tickets are high-priority.
- Update the Handbook: If the experiment reveals a flaw in our operational processes, update the relevant section of this handbook.
- Automate the Experiment: Once the vulnerability is fixed, add the chaos experiment to our automated suite to prevent regressions.
4. Sparky's Role in Chaos Engineering
Our AI agent, Sparky, plays a crucial role in automating and enhancing our chaos engineering practice.
- Automated Experiment Design: Cerebro can analyze past incidents and suggest new chaos experiments to test for similar failure modes. Sparky can then automatically create the initial Jira ticket and experiment configuration.
- Intelligent Abort Conditions: Instead of simple static thresholds, Sparky can monitor a complex set of metrics and use Cerebro's predictive models to determine if an experiment is heading towards a dangerous, unpredicted failure, and automatically halt it.
- Automated Triage of Experiment Fallout: If a chaos experiment causes a real incident, Sparky is the first responder. It can immediately identify that the failure is part of a planned experiment, link it to the corresponding Jira ticket, and prevent a full-blown, all-hands-on-deck incident response for an expected failure.
- Validating Automated Fixes: When Sparky proposes an automated fix for a production issue, we can use a chaos experiment to validate that the fix actually works. Sparky can create a PR for the fix and a PR for a new chaos experiment that simulates the original failure, ensuring the fix is robust before it's even merged.
5. Game Day Plan Template
Use this template in Jira for every new chaos experiment.
### Chaos Experiment: [Brief, Descriptive Title]
**1. Steady State Hypothesis:**
> We believe that the system will maintain the following characteristics:
> - **Availability:** [e.g., API success rate >= 99.9%]
> - **Latency:** [e.g., p95 latency for service X < 300ms]
> - **Other:** [e.g., Queue depth for service Y < 100]
**2. Experiment Hypothesis:**
> If we [**FAULT TYPE**] on [**TARGET COMPONENT**], then [**EXPECTED OUTCOME**] because [**REASONING**].
**3. Experiment Design:**
> - **Type:** [Latency, Error, Resource, Network]
> - **Tool:** [Gremlin, Custom Script, etc.]
> - **Magnitude:** [e.g., 500ms latency, 50% of requests return 503]
> - **Blast Radius:** [e.g., Internal users only, one specific tenant, 10% of canary traffic]
**4. Abort Conditions (STOP anuthing):**
> The experiment will be immediately halted if:
> - [e.g., Platform-wide availability drops below 99.5%]
> - [e.g., p99 latency exceeds 1000ms]
**5. Rollback Plan:**
> [e.g., Execute Gremlin halt command, disable feature flag]
**6. Post-Experiment Analysis:**
> - **Did the system behave as expected?** [Yes/No]
> - **Root Cause:** [If no, detailed analysis]
> - **Action Items:** [Link to Jira tickets]