Sentry Guide: Error Tracking and Performance Monitoring

1. Introduction: Capturing Every Exception

Sentry is our primary tool for real-time error tracking and performance monitoring at the application level. While New Relic provides infrastructure and APM metrics, Sentry gives us deep, code-level visibility into exceptions, crashes, and front-end performance issues. Every error, from a backend Python exception to a JavaScript error in a customer's browser, is captured and analyzed here.

This guide details how we use Sentry, how it integrates with Sparky for automated triage, and the best practices all engineers should follow.

Core Mission: To ensure that no error goes unnoticed, and that every exception is a learning opportunity that feeds back into our platform's resilience.

2. Usage in Handbook Sections

Sentry is a critical event source for many of our core operational workflows:

Proactive Problem Resolution: Sentry errors are a primary trigger for Sparky's triage workflow. A spike in a particular exception can activate an L1 playbook or an L2 investigation.
SRE and Monitoring/Availability: Sentry's error rates are a key SLI. A sudden increase in the error rate for a service will burn our error budget and trigger a PagerDuty alert.
Service Chain Monitoring: Sentry helps us pinpoint failures in the service chain, especially on the front-end (Control Center, Experience Center) where traditional APM has less visibility.
Resilience and Testing: During chaos experiments, we monitor Sentry closely to see how our injected failures manifest as application-level errors.

3. Key Features and Best Practices

Project Configuration: Every microservice and front-end application has its own Sentry project. This allows us to set ownership rules and alerts on a per-service basis.
Ownership Rules: We use Sentry's ownership rules to automatically assign new issues to the correct team. This is configured in the codebase using a sentry.properties file.
Release Health: We track the health of every new release. A deployment is only considered successful if the new release does not introduce new, high-frequency errors.
Source Maps: For all front-end projects, we automatically upload source maps to Sentry during the build process. This is non-negotiable, as it allows us to see un-minified, readable stack traces.
Context, Context, Context: We enrich every Sentry event with as much context as possible. All backend services should include the request_id, tenant_id, and user_id in the Sentry scope. This allows Sparky and human engineers to immediately correlate an error with a specific user journey or Knowledge Graph entity.

Code Example: Adding Context in Python (Flask)

from sentry_sdk import configure_scope

@app.before_request
def before_request():
    # Assume we get user and tenant from a request header or session
    user = get_user_from_request()
    tenant = get_tenant_from_request()

    with configure_scope() as scope:
        scope.set_user({"id": user.id, "email": user.email})
        scope.set_tag("tenant_id", tenant.id)
        # This trace_id should correlate with our New Relic traces
        scope.set_tag("trace_id", request.headers.get('X-Trace-ID'))

4. Verbose Workflow: From Sentry Error to Sparky Fix

This workflow details how a simple code error is automatically detected, triaged, and resolved.

Error Occurs: A NullPointerException occurs in the knowledge-graph-api service.
Sentry Captures Event: The Sentry SDK in the service catches the unhandled exception and sends it to Sentry, along with all the context we added (user, tenant, trace_id).
Ownership & Alerting: Sentry's rules determine this is owned by the #knowledge-graph-team and that it's a new, high-frequency issue.
Fire Webhook: Sentry sends a detailed webhook payload to Sparky's Ingestor.
Sparky Triage:
- L1 Check: Sparky queries the Knowledge Graph: "Does this error correlate with a known playbook?" The answer is no.
- L2 Analysis: Sparky passes the stack trace and context to Cerebro. Cerebro analyzes the code and the context: "The error happens because the user object can be null in this specific API call, which was recently modified." Cerebro generates a code fix: add a null check.
Sparky Action: Sparky creates a Pull Request in GitHub with the proposed null check, assigning it to the on-call engineer for the Knowledge Graph team.
Link Everything: Sparky then uses the Sentry, Jira, and GitHub APIs to link everything together. The Sentry issue is linked to the Jira ticket and the GitHub PR. An engineer can now see the full story in any of the three tools.

5. Template: Sentry Issue Triage Checklist

When manually triaging a Sentry issue that Sparky has escalated (L3), follow this checklist:

[ ] 1. Acknowledge and Assign: Assign the Sentry issue to yourself.
[ ] 2. Review Sparky's Diagnostics: Read the linked Jira ticket created by Sparky. Does the analysis make sense?
[ ] 3. Check for Duplicates: Is this a duplicate of an existing, known issue?
[ ] 4. Assess the Impact: Use the tenant_id and user_id tags to understand the blast radius. Is this affecting one customer or all of them?
[ ] 5. Analyze the Stack Trace: Use the source maps to pinpoint the exact line of code causing the error.
[ ] 6. Reproduce the Issue: Can you reproduce the error in a local or staging environment?
[ ] 7. Propose a Fix: Create a branch and open a PR with your proposed fix. Link the PR back to the Sentry issue.
[ ] 8. Mark for Release: Once the fix is merged, mark the Sentry issue as "Resolved in next release".

1. Introduction: Capturing Every Exception​

2. Usage in Handbook Sections​

3. Key Features and Best Practices​

4. Verbose Workflow: From Sentry Error to Sparky Fix​

5. Template: Sentry Issue Triage Checklist​

1. Introduction: Capturing Every Exception

2. Usage in Handbook Sections

3. Key Features and Best Practices

4. Verbose Workflow: From Sentry Error to Sparky Fix

5. Template: Sentry Issue Triage Checklist