SRE and Monitoring/Availability
Introduction
This section outlines the to-be state for SRE practices focused on monitoring and ensuring platform availability. We aim for 99.99% uptime, using AI-driven tools to monitor health across all architectural components. Dashboards will provide views from platform-wide to individual customer tenants.
2. Defining Our Promises: SLOs and SLIs
A Service Level Objective (SLO) is a target value or range of values for a service level that is measured by a Service Level Indicator (SLI). SLIs are the direct measurements of our platform's performance, while SLOs are the promises we make to our customers (and ourselves). Our primary SLO is 99.99% availability.
Our Error Budget is the time we are allowed to be down, which is 100% - 99.99% = 0.01%. This translates to approximately 4 minutes and 20 seconds of downtime per month.
Core SLIs
| SLI Name | Description | Measurement (NRQL Query) | Threshold |
|---|---|---|---|
| API Availability | Percentage of successful API requests (HTTP status < 500) to the main Platform API gateway. | SELECT (count(\*)-filter(count(\*), WHERE http.statusCode >= 500)) / count(\*) * 100 FROM Transaction WHERE appName = 'platform-api-gateway' | > 99.99% |
| API Latency (p95) | 95th percentile of response time for requests to the Platform API gateway. | SELECT percentile(duration, 95) FROM Transaction WHERE appName = 'platform-api-gateway' | < 250ms |
| Knowledge Graph Ingestion Freshness | The time delay between an event occurring and it being available in the 'gold' layer of the Knowledge Graph. | SELECT max(timestamp) - min(timestamp) FROM KgGoldLayerEvent | < 5 minutes |
| Control Center Page Load (p99) | 99th percentile of the full page load time for the main Control Center dashboard. | SELECT percentile(pageLoadTime, 99) FROM BrowserInteraction WHERE appName = 'control-center-frontend' | < 2 seconds |
3. Verbose Workflow: From Alert to Resolution
This workflow details the automated and manual steps taken when an SLI begins to burn our error budget. This process is designed for speed and precision.
-
Automated Alerting:
- Source: New Relic Alerts are configured for each of our core SLIs.
- Condition: An alert fires when the SLI performance drops to a level that threatens the SLO (e.g., availability drops below 99.99% over a 5-minute window).
- Action: New Relic triggers a
Criticalincident in PagerDuty, which immediately notifies the on-call SRE.
-
Sparky's Initial Triage (The first 60 seconds):
- Simultaneously, the New Relic alert also fires a webhook to Sparky.
- Sparky queries the Knowledge Graph for immediate context: "What deployments happened in the last 30 minutes?", "Are there any active chaos experiments?", "Is this affecting a single tenant or the whole platform?".
- If a high-confidence L1 playbook matches (e.g., "This alert correlates with a recent deployment"), Sparky immediately posts a message in the incident's Slack channel: " correlated with deployment
xyz. Suggesting automated rollback."
-
Human Response & Incident Management:
- The on-call SRE acknowledges the PagerDuty alert and assumes the role of Incident Commander (IC).
- The IC's first action is to check the Slack channel to see Sparky's initial diagnosis.
- Based on Sparky's input and their own initial look at the dashboards, the IC decides whether to approve Sparky's suggested L1 action or to assemble a team of Subject Matter Experts (SMEs).
- The full incident response process, including roles and communications, is detailed in the Change and Incident Management guide.
-
Resolution and Post-Mortem:
- Once the incident is resolved (either by Sparky's L1 action or an L2/L3 fix by the team), the fix is verified against the dashboards.
- A blameless post-mortem is automatically generated by Cerebro and reviewed by the team to ensure we learn from the event and create preventative actions.
4. Dashboards: Visualizing Our Health
Dashboards are our single source of truth for platform health. They are built in New Relic and are accessible to everyone in the engineering organization.
The SRE Primary Dashboard
This is the main dashboard for the on-call SRE. It shows the health of our core SLOs at a glance.
Dashboard Types
- Platform-Wide Health: The primary SRE dashboard, showing the overall health of the entire XOPS platform against our top-level SLOs.
- Service-Specific Dashboards: Every microservice has its own detailed dashboard showing its specific SLIs, resource usage (CPU/Mem), error rates, and dependencies. These are owned by the service's engineering team.
- Tenant-Specific Dashboards: The Customer Success team has access to a set of dashboards that show the health and performance of the platform from the perspective of our largest customers. These are built to be easily shared during QBRs.
- Chaos Day Dashboard: A specialized dashboard used during resilience testing to monitor the impact of chaos experiments in real-time.