Skip to main content

New Relic Guide: Our Single Pane of Glass for Observability

1. Introduction: Seeing the Whole System

New Relic is the cornerstone of our observability strategy. It is the single platform where we collect, visualize, and alert on the full spectrum of our telemetry—metrics, events, logs, and traces (MELT). From high-level SLO dashboards to deep, code-level transaction traces, New Relic provides the tools we need to understand our system's behavior in real time.

This guide details our best practices for using New Relic, including how to instrument applications, build dashboards, and leverage its data for troubleshooting.

Core Mission: To use New Relic to create a shared, data-driven understanding of our platform's health, making observability a part of every engineer's daily workflow.


2. Usage in Handbook Sections

New Relic is the data source for many of our most critical operational processes:

  • SRE and Monitoring/Availability: New Relic is where our SLIs are measured, our SLOs are tracked, and our error budget alerts are configured. The SRE Primary Health Dashboard lives here.
  • Performance Tuning: New Relic's APM and profiling tools are the starting point for any performance investigation.
  • Service Chain Monitoring: End-to-end distributed traces in New Relic allow us to visualize and troubleshoot the entire service chain.
  • Incident Management: When an incident occurs, the first place the Incident Commander looks is the relevant New Relic dashboard.

3. Key Features and Best Practices

  • Unified Agent: We use the New Relic infrastructure agent with the APM agent integrated. This provides a unified view of both infrastructure metrics (CPU/Mem) and application performance.
  • Standardized Tagging: All New Relic data is tagged with serviceName, environment, awsRegion, and ownerTeam. This allows for easy filtering and grouping.
  • NRQL Everywhere: New Relic Query Language (NRQL) is the standard way to query our observability data. All engineers are encouraged to learn basic NRQL. Alerts, dashboard widgets, and custom reports are all built with NRQL.
  • Dashboards as Code: All critical team and service dashboards are defined as code using New Relic's Terraform provider. This ensures they are version-controlled, repeatable, and can be peer-reviewed.

Example NRQL Queries

Use CaseNRQL Query
API Error RateSELECT percentage(count(*), WHERE http.statusCode >= 500) FROM Transaction WHERE appName = 'platform-api-gateway' TIMESERIES
Top 5 Slowest TransactionsSELECT average(duration) FROM Transaction WHERE appName = 'my-service' FACET name LIMIT 5
CPU Usage by PodSELECT average(cpuPercent) FROM K8sContainerSample FACET podName WHERE deploymentName = 'my-deployment' TIMESERIES
Log Error CountSELECT count(*) FROM Log WHERE level = 'ERROR' AND service_name = 'my-service' TIMESERIES

4. Verbose Workflow: End-to-End Distributed Tracing

This workflow explains how to use New Relic to trace a slow request from the customer's browser all the way to the database.

  1. The Symptom: A customer reports that loading their dashboard in the Control Center is slow.
  2. Start at the Browser: In New Relic Browser, find the specific BrowserInteraction event for that user's session. You see that the initial page load is fast, but a specific AJAX request to /api/v1/dashboard-data is taking 5 seconds.
  3. Jump to the Backend Trace: The browser trace is automatically linked to the backend APM trace. Click the link to follow the request into our infrastructure.
  4. Analyze the Trace Waterfall: The distributed trace waterfall diagram shows the request's journey:
    • platform-api-gateway: 20ms
    • authentication-service: 50ms
    • dashboard-data-service: 4900ms (This is the bottleneck!)
    • knowledge-graph-api: 30ms (Called by dashboard-data-service)
  5. Drill into the Slow Service: Click on the dashboard-data-service span in the trace. The trace details show that 4800ms of the time is spent in a single, complex database query.
  6. Find the Root Cause: The trace provides the full text of the slow query, the stack trace of the code that executed it, and a link to the database host's performance metrics at that exact time.
  7. Create an Actionable Ticket: You now have everything you need. You create a Jira ticket for the dashboard-data-service team with a link to the New Relic trace and the specific slow query that needs to be optimized. The entire investigation took less than 10 minutes.

5. Template: Standard Service Dashboard JSON

This JSON blob is the "Dashboard as Code" template for a standard microservice, managed via Terraform. It can also be imported directly into the New Relic UI.

{
"name": "TEMPLATE - My Service Health",
"permissions": "PUBLIC_READ_WRITE",
"pages": [
{
"name": "Overview",
"widgets": [
{
"title": "API Throughput (Requests per Minute)",
"visualization": "billboard",
"nrqlQueries": [
{ "query": "SELECT rate(count(*), 1 minute) FROM Transaction WHERE appName = 'my-service'" }
]
},
{
"title": "API Error Rate (%)",
"visualization": "gauge",
"nrqlQueries": [
{ "query": "SELECT percentage(count(*), WHERE http.statusCode >= 500) FROM Transaction WHERE appName = 'my-service'" }
],
"thresholds": [ { "value": 1, "alertSeverity": "CRITICAL" } ]
},
{
"title": "API Latency (p95)",
"visualization": "billboard",
"nrqlQueries": [
{ "query": "SELECT percentile(duration, 95) FROM Transaction WHERE appName = 'my-service'" }
]
},
{
"title": "CPU Utilization (%)",
"visualization": "line_chart",
"nrqlQueries": [
{ "query": "SELECT average(cpuPercent) FROM K8sContainerSample WHERE deploymentName = 'my-service-deployment' TIMESERIES" }
]
},
{
"title": "Memory Utilization (MiB)",
"visualization": "line_chart",
"nrqlQueries": [
{ "query": "SELECT average(memoryWorkingSetBytes) / 1048576 FROM K8sContainerSample WHERE deploymentName = 'my-service-deployment' TIMESERIES" }
]
}
]
}
]
}