PagerDuty Guide
Introduction
PagerDuty is the central tool for incident alerting, on-call management, escalation, post-mortems, and maintenance scheduling in the XOPS operational framework. It integrates with other tools like Sentry, New Relic, and Jira for seamless workflows. This verbose guide covers detailed usage, workflows, templates, best practices, and how it ties into all relevant handbook sections. We emphasize its role in incident management, post-mortems, and downtime scheduling, as per the operational toolset.
PagerDuty ensures 24/7 coverage with AI-augmented responses via Sparky, aligning with 2026 SRE standards for automated escalation and resolution.
Usage in Handbook Sections
- SRE and Monitoring/Availability: Primary alerting for availability breaches (e.g., from New Relic dashboards).
- Proactive Problem Resolution: Receives alerts from anomaly detection in Cerebro; triggers Sparky for triage.
- Change and Incident Management: Handles escalations from L2 to L3, integrates with Jira for ticket creation.
- Platform Operations and Maintenance: Schedules maintenance windows to suppress alerts during planned downtime.
- Resilience and Testing: Alerts during chaos engineering tests.
- Service Chain Monitoring: Notifies on chain-specific issues, distinguishing core vs. external problems.
- Customer Success and QBRs: Logs incidents for review in QBRs.
- Monitoring Framework: Integrates with Observability API for platform-wide alerts.
Key Features and Best Practices
- On-Call Schedules: Rotate teams with escalation policies (e.g., L1 to L2 in 5 mins, L2 to L3 in 15 mins).
- Integration Rules: Set up event rules to auto-resolve low-severity alerts via Sparky.
- Maintenance Windows: Schedule to avoid false positives during updates.
- Analytics: Use for MTTR reporting, integrated with Knowledge Graph for AI insights.
- Security: Use StrongDM for access, rotate API keys quarterly.
- Best Practices: Define clear response SLAs (e.g., acknowledge < 2 mins), use mobile app for on-call, conduct quarterly drills.
Verbose Workflow: Incident Management with Post-Mortem
This workflow covers the full lifecycle of an incident, from detection to learning, incorporating Sparky for automation.
-
Incident Detection:
- Trigger: Alert from integrated tools (e.g., Sentry for errors, New Relic for performance thresholds, or Cerebro for predictive anomalies in the service chain).
- Event Ingestion: PagerDuty receives the event via API or webhook. Classify severity (e.g., P1 for outages, P3 for minor issues).
- Notification: PagerDuty notifies on-call personnel via email, SMS, or app push. If after hours, escalate per policy.
-
Triage and Acknowledgment:
- On-Call Responds: Acknowledge incident to stop notifications.
- Sparky Integration: For P3-P5 issues, Sparky auto-triages using Knowledge Graph data (e.g., "Is this a known Auth0 latency?"). If resolvable, Sparky executes fix via Autonomous Engine and resolves the incident.
- Manual Triage: If Sparky escalates, L2 engineer investigates using dashboards (e.g., New Relic traces) and logs (via ELK integrated with Sentry).
-
Escalation Management:
- If not resolved in SLA time, auto-escalate to L3/engineering.
- Create Jira ticket: Use PagerDuty-Jira integration to generate a pre-filled ticket with details (e.g., timeline, impacted chain elements).
- Collaboration: Use PagerDuty's incident console for real-time updates, integrating with Slack or Teams for team comms.
-
Resolution:
- Apply Fix: Engineers deploy changes via GitHub Actions CI/CD, monitored by New Relic.
- Verify: Run health checks across the service chain to confirm resolution.
- Customer Notification: If customer-impacting, send autonomous update via Platform API (e.g., "Issue resolved—core operations were unaffected").
-
Post-Mortem:
- Initiate: After resolution, PagerDuty auto-creates a post-mortem timeline.
- Analysis: Use template below; involve Cerebro for root cause insights (e.g., SageMaker model review).
- Learning: Update Knowledge Graph with findings; retrain models if needed.
- Review: Share in QBRs or team meetings.
-
Closure and Reporting:
- Resolve Incident: Mark as resolved in PagerDuty.
- Analytics: Generate reports on MTTD/MTTR, feed into operational dashboards.
Verbose Workflow: Downtime/Maintenance Scheduling
This ensures planned work doesn't trigger unnecessary alerts.
-
Planning:
- Identify Need: For updates, pen testing, or chaos engineering (from Resilience section).
- Schedule: Use PagerDuty's maintenance window feature to set time, services affected (e.g., Control Center only).
-
Notification:
- Internal: Alert team via PagerDuty.
- External: Proactive customer notification (e.g., "Scheduled maintenance on auth chain—minimal impact expected").
-
Execution:
- Suppress Alerts: PagerDuty ignores events during window.
- Monitor: Still track via New Relic for any anomalies.
-
Completion:
- End Window: Verify system health post-maintenance.
- Report: Log any issues in Jira for review.
Template: Post-Mortem Report
Below is a detailed template for post-mortems, to be filled in PagerDuty or exported to Jira.
-
Incident Summary:
- Incident ID: [PagerDuty ID]
- Title: [Brief Description, e.g., "Control Center Latency Spike"]
- Severity: [P1-P5]
- Date/Time: [Start/End]
- Impacted Services: [e.g., Auth0 -> Control Center; Core Knowledge Graph unaffected]
-
Timeline:
- [Timestamp]: Detection (e.g., Sentry alert at 14:00).
- [Timestamp]: Acknowledgment (e.g., Sparky auto-ack at 14:02).
- [Timestamp]: Escalation (e.g., To L3 at 14:15).
- [Timestamp]: Resolution (e.g., Fix deployed at 14:30).
-
Root Cause Analysis:
- Description: [Detailed cause, e.g., "External Auth0 API latency due to their maintenance; no internal failure."]
- Contributing Factors: [e.g., Lack of caching in chain.]
- Service Chain Impact: [e.g., User-facing only; autonomous runbooks continued.]
-
Impact Assessment:
- Customers Affected: [Number/Names]
- Downtime: [Duration, e.g., 30 mins partial]
- Business Impact: [e.g., Delayed logins, but no data loss.]
-
Actions Taken:
- Immediate Fix: [e.g., Sparky rerouted traffic.]
- Long-Term: [e.g., Add redundancy for Auth0.]
-
Lessons Learned:
- [e.g., Improve chain monitoring for external deps.]
- Follow-Up Tasks: [Jira links, assigned owners, deadlines.]
-
Preventive Measures:
- [e.g., Integrate predictive alerts for external services via Cerebro.]
Examples
- Post-Mortem: After a P2 incident from AWS zone failure, template captures how chaos testing prepared us, with Sparky auto-failover.
- Maintenance: Scheduling quarterly pen testing suppresses alerts, allowing secure updates without noise.
Integration with Other Tools
- Sentry/New Relic: Event sources for alerts.
- Jira: Ticket creation for follow-ups.
- Sparky: Auto-triage low-severity incidents.
- AWS/Cloudflare: Monitor infrastructure events.
This guide ensures PagerDuty is used effectively for world-class operations.