index
Table of Contents
Part 1: Core Concepts & Philosophy
- Shared Responsibility Model: Defining the boundaries of accountability for data quality, integrations, and external dependencies.
- SRE and Monitoring: Our core principles for Site Reliability Engineering, including our SLO/SLI definitions and error budgets.
- Architectural Monitoring: How we observe our system with the three pillars: Metrics, Logs, and Traces.
- Service Chain Monitoring: How we monitor end-to-end user journeys, including third-party dependencies.
- Workflows and Sparky, our AI Agent: An overview of Sparky, the AI agent that automates our operations.
Part 2: Processes & Workflows
- Change & Incident Management: How we manage change safely and respond to incidents effectively.
- Performance Tuning: Our methodology for diagnosing and fixing performance bottlenecks.
- Resilience & Chaos Engineering: How we proactively test our system's resilience by injecting failure.
- Platform Operations & Maintenance: Our rhythm of periodic checks, security audits, and maintenance.
- Proactive Problem Resolution: How we use AI to detect and fix problems before they impact users.
- Customer Success & Interaction: How we route customer requests and manage the customer experience.
- Ecosystem Integrations Management: How we manage the lifecycle of third-party integrations.
Part 3: Tooling
- Toolset Guides: A central index for all our operational tools.
Part 4: Advanced Topics & Strategies
- FinOps: Financial Operations: Managing our cloud spend for efficiency and value.
- Data Governance Policy: Ensuring data integrity, privacy, and compliance.
- MLOps Lifecycle: Managing the lifecycle of our machine learning models.
- AI Labelling Procedures: Standards for data labeling for AI training.
- Quality Engineering Strategy: Our comprehensive approach to software quality, including testing and automation.
All workflows incorporate Sparky, an AI agent that handles initial triage (L1/L2) and proposes fixes (L3) via PRs, triggered by webhooks from Sentry or Jira Service Management. See the Sparky, our AI Agent guide for more details.