Skip to main content

Change and Incident Management

1. Introduction: Balancing Speed and Stability

In a world-class SRE organization, the goal is not to prevent change, but to enable it to happen quickly and safely. At the same time, we must be prepared to respond to unplanned incidents with speed, precision, and a commitment to learning. Change Management and Incident Management are two sides of the same coin: managing risk and maintaining stability in a complex, evolving system.

This document outlines our unified framework for handling both planned changes and unplanned incidents. It defines the processes, roles, and tools that allow us to move fast without breaking things.

Core Mission: To create a resilient system where changes are safe and incidents are rare, brief, and informative.


2. Change Management: The Path to Production

All changes to the production environment, without exception, must follow this process. The goal is to ensure that every change is reviewed, tested, and approved before it reaches our customers.

Change Categories

We classify changes into two types:

  1. Standard Change: A routine, planned change that follows the full RFC and deployment pipeline process. This accounts for >99% of all changes. Examples: deploying a new feature, updating a dependency, modifying a database schema.
  2. Emergency Change: A high-risk, urgent change required to resolve a P1 incident. This process bypasses some of the standard checks for the sake of speed, and requires a higher level of approval and post-change review.

The Standard Change Workflow

  1. Create RFC: The developer creates a "Request for Change" ticket in Jira Service Management using the standard template defined in that guide. This documents the what and the why of the change.
  2. Peer Review: The change is reviewed by at least one other engineer. The focus is on the implementation plan, verification plan, and rollback plan.
  3. Approval: The RFC must be approved by the tech lead for the service(s) being changed.
  4. Merge & Deploy: Once approved, the corresponding GitHub PR is merged, triggering the automated CI/CD pipeline as detailed in the GitHub Actions guide.
  5. Verify: After deployment, the engineer follows their verification plan to confirm the change was successful.
  6. Close: The engineer closes the JSM ticket, signaling the completion of the change process.

The Emergency Change Workflow

This workflow is only for resolving a live P1 incident.

  1. Incident Declared: A P1 incident is active.
  2. Verbal Approval: The Incident Commander gives verbal approval for an emergency change.
  3. Code & Deploy: The engineer codes the fix. The PR can bypass standard E2E tests, but must still pass security scans and unit tests. Deployment can be fast-tracked with a second approval.
  4. Create Retroactive RFC: Immediately after the incident is resolved, the engineer creates a JSM RFC ticket with the Emergency label, documenting the change that was made.
  5. Post-Incident Review: The emergency change is a primary topic of the incident's post-mortem. We analyze why the emergency process was necessary and what we can do to avoid it in the future.

3. Incident Management: Responding to the Unexpected

Our incident management process is designed to restore service as quickly as possible while capturing the necessary information for post-mortem analysis.

Incident Severity Levels

  • P1 (Critical): A major, customer-facing outage or data loss. (e.g., Platform API is down, Control Center is inaccessible). All hands on deck.
  • P2 (High): A significant degradation of service or a failure of a core internal system. (e.g., API latency is >2s, Knowledge Graph ingestion is failing). On-call team actively working.
  • P3 (Medium): A minor issue or a failure with a limited blast radius. (e.g., A single, non-critical feature is broken). Handled during business hours.
  • P4/P5 (Low): Cosmetic issues, documentation errors, or other low-impact problems.

Roles and Responsibilities (for P1/P2 Incidents)

  • Incident Commander (IC): The single point of authority. Manages the overall response, communications, and resources. Does not fix the problem directly.
  • Communications Lead: Manages all internal and external communication. Updates the status page, communicates with customer success.
  • Operations Lead: The technical lead responsible for executing the remediation plan.
  • Subject Matter Experts (SMEs): Engineers from the relevant services who are doing the hands-on work of diagnosing and fixing the issue.

The Incident Response Workflow

The full workflow, including PagerDuty's role, is detailed in the PagerDuty Guide. The key phases are:

  1. Detection: An issue is detected by New Relic, Sentry, or a manual report. A PagerDuty incident is created.
  2. Triage & Declaration: The on-call engineer assesses the impact and declares the severity level. If P1/P2, the Incident Commander is engaged.
  3. Diagnosis: SMEs work to identify the root cause.
  4. Remediation: The team implements a fix (which may involve an Emergency Change).
  5. Resolution: The service is restored.
  6. Post-Mortem: A blameless post-mortem is conducted for every P1 and P2 incident.

4. Sparky's Role in the Process

Sparky is a key player in both change and incident management, acting as an intelligent assistant to the human teams.

  • In Change Management:

    • RFC Validation: When an RFC is created, Sparky can automatically scan it for completeness. Does it have a rollback plan? Is the verification plan robust?
    • Automated Verification: Sparky can execute parts of the verification plan automatically, such as running synthetic tests or querying metrics, and post the results back to the JSM ticket.
  • In Incident Management:

    • Automated Triage: As detailed in the Sparky AI Agent Guide, Sparky performs the initial L1/L2 triage on all incoming events, resolving many of them before they ever require a human response.
    • Context for Humans: For L3 escalations, Sparky prepares the incident "package" by gathering all relevant data, saving the Incident Commander and SMEs critical time.
    • Incident Scribe: Sparky can act as an automated scribe during an incident, logging key decisions and actions from the Slack channel into the PagerDuty incident timeline.
    • Post-Mortem Helper: After an incident, Sparky and Cerebro can draft the initial post-mortem report by summarizing the timeline and suggesting contributing factors based on Knowledge Graph data.