Sparky: The AI Operations Agent
1. Introduction: Automating World-Class Operations
Sparky is a custom-built AI agent that serves as the backbone of our automated operations. It is the intelligent, proactive, and tireless engineer that handles the first response to any operational event on the XOPS platform. Its primary directive is to reduce toil, accelerate resolution, and learn from every event to make the entire system more resilient.
This document provides a comprehensive overview of Sparky's architecture, its decision-making logic, and its library of operational playbooks.
Core Mission: To autonomously handle 80% of L1/L2 operational tasks, allowing human engineers to focus on high-value L3 work, strategic initiatives, and innovation.
2. Sparky's Architecture: An Integrated AI Ecosystem
Sparky is not a standalone service. It is deeply integrated with our core architectural components, acting as the intelligent glue between observation and action.
Component Breakdown:
- Webhook Ingestor: A serverless function that listens for incoming webhooks from our toolchain. It validates, authenticates, and normalizes these events before passing them to the Triage Engine.
- Triage Engine: The brain of Sparky. This engine orchestrates the L1/L2/L3 decision-making process.
- Knowledge Graph Integration: Sparky's first step is always to query the Knowledge Graph. It asks questions like:
- "What is the full context of this event?"
- "Has this happened before?"
- "What services are downstream of the affected component?"
- "Was there a recent deployment to this service?"
- Cerebro Integration: For complex or unknown events, Sparky consults Cerebro for deeper analysis, asking:
- "Based on historical data, what is the likely root cause?"
- "What is the predicted impact of this event?"
- "What is the recommended fix?"
- Autonomous Engine Integration: Once a course of action is decided, Sparky uses the Autonomous Engine to execute commands securely and auditably.
3. The Triage Logic: L1, L2, and L3 Resolution
Sparky's triage logic is designed to be fast, safe, and effective. It follows a hierarchical decision-making process, ensuring that it only takes actions for which it has high confidence.
L1: Automated Resolution (High-Confidence, Low-Risk)
- Trigger: An event matches a pre-defined, high-confidence playbook. These are for simple, common issues with deterministic fixes.
- Action: Sparky executes the playbook without human intervention.
- Examples:
- Restarting a pod that has entered a crash loop for a known, benign reason.
- Scaling up a service in response to a predictable traffic spike.
- Clearing a temporary cache.
- Guardrails: L1 actions are only ever taken if the playbook has been run successfully >99% of the time and the blast radius is confirmed to be minimal.
L2: AI-Generated Resolution (Medium-Confidence, Review-Required)
- Trigger: The issue is not a simple L1 case, but Cerebro can analyze the situation and propose a specific, code-level fix with high confidence.
- Action: Sparky generates the code for the fix and opens a Pull Request on GitHub. The PR is assigned to the on-call L3 engineer for that service.
- The PR is a complete package:
- Title:
Sparky L2 Fix: [Brief Description of Fix] - Body: Contains a detailed explanation of the root cause, the proposed fix, and links to relevant logs and metrics from the Knowledge Graph.
- Automated Tests: The PR includes newly generated unit and integration tests to validate the fix.
- Title:
- Goal: To reduce the engineer's work from "diagnose and fix" to just "review and approve".
L3: Human Escalation (Low-Confidence or High-Risk)
- Trigger: The issue is novel, complex, or potentially high-risk, and neither the playbooks nor Cerebro can determine a safe, automated course of action.
- Action: Sparky's job is now to prepare the most comprehensive and useful bug report possible for the human engineer.
- The Escalation Package:
- Gather all context: Sparky queries the Knowledge Graph for all logs, traces, metrics, recent deployments, and related incidents.
- Create a Jira Ticket: It creates a P1/P2 ticket in Jira, pre-filled with all the gathered context.
- Escalate via PagerDuty: It triggers a PagerDuty incident, linking directly to the Jira ticket.
- Goal: To ensure that by the time an engineer is paged, they have all the information they need to start working on the problem immediately, minimizing MTTD and MTTR.
4. Sparky's Playbooks
This section serves as an index for Sparky's L1 automated resolution playbooks. Each playbook is a separate, detailed document outlining the trigger, the exact steps Sparky takes, and the verification process.
- Playbook-001: Stateless Pod Restart - Status: Active
- Playbook-002: Service Auto-Scaling - Status: Active
- Playbook-003: Proactive API Token Expiry Notification - Status: Active
- Playbook-004: Cache Flush for Data Inconsistency - Status: In Development
- [More playbooks to be added as they are developed]
(Note: The linked playbook files will be created in subsequent steps as this project is built out).