New Relic Guide: Our Single Pane of Glass for Observability
1. Introduction: Seeing the Whole System
New Relic is the cornerstone of our observability strategy. It is the single platform where we collect, visualize, and alert on the full spectrum of our telemetry—metrics, events, logs, and traces (MELT). From high-level SLO dashboards to deep, code-level transaction traces, New Relic provides the tools we need to understand our system's behavior in real time.
This guide details our best practices for using New Relic, including how to instrument applications, build dashboards, and leverage its data for troubleshooting.
Core Mission: To use New Relic to create a shared, data-driven understanding of our platform's health, making observability a part of every engineer's daily workflow.
2. Usage in Handbook Sections
New Relic is the data source for many of our most critical operational processes:
- SRE and Monitoring/Availability: New Relic is where our SLIs are measured, our SLOs are tracked, and our error budget alerts are configured. The SRE Primary Health Dashboard lives here.
- Performance Tuning: New Relic's APM and profiling tools are the starting point for any performance investigation.
- Service Chain Monitoring: End-to-end distributed traces in New Relic allow us to visualize and troubleshoot the entire service chain.
- Incident Management: When an incident occurs, the first place the Incident Commander looks is the relevant New Relic dashboard.
3. Key Features and Best Practices
- Unified Agent: We use the New Relic infrastructure agent with the APM agent integrated. This provides a unified view of both infrastructure metrics (CPU/Mem) and application performance.
- Standardized Tagging: All New Relic data is tagged with
serviceName,environment,awsRegion, andownerTeam. This allows for easy filtering and grouping. - NRQL Everywhere: New Relic Query Language (NRQL) is the standard way to query our observability data. All engineers are encouraged to learn basic NRQL. Alerts, dashboard widgets, and custom reports are all built with NRQL.
- Dashboards as Code: All critical team and service dashboards are defined as code using New Relic's Terraform provider. This ensures they are version-controlled, repeatable, and can be peer-reviewed.
Example NRQL Queries
| Use Case | NRQL Query |
|---|---|
| API Error Rate | SELECT percentage(count(*), WHERE http.statusCode >= 500) FROM Transaction WHERE appName = 'platform-api-gateway' TIMESERIES |
| Top 5 Slowest Transactions | SELECT average(duration) FROM Transaction WHERE appName = 'my-service' FACET name LIMIT 5 |
| CPU Usage by Pod | SELECT average(cpuPercent) FROM K8sContainerSample FACET podName WHERE deploymentName = 'my-deployment' TIMESERIES |
| Log Error Count | SELECT count(*) FROM Log WHERE level = 'ERROR' AND service_name = 'my-service' TIMESERIES |
4. Verbose Workflow: End-to-End Distributed Tracing
This workflow explains how to use New Relic to trace a slow request from the customer's browser all the way to the database.
- The Symptom: A customer reports that loading their dashboard in the Control Center is slow.
- Start at the Browser: In New Relic Browser, find the specific
BrowserInteractionevent for that user's session. You see that the initial page load is fast, but a specific AJAX request to/api/v1/dashboard-datais taking 5 seconds. - Jump to the Backend Trace: The browser trace is automatically linked to the backend APM trace. Click the link to follow the request into our infrastructure.
- Analyze the Trace Waterfall: The distributed trace waterfall diagram shows the request's journey:
platform-api-gateway: 20msauthentication-service: 50msdashboard-data-service: 4900ms (This is the bottleneck!)knowledge-graph-api: 30ms (Called bydashboard-data-service)
- Drill into the Slow Service: Click on the
dashboard-data-servicespan in the trace. The trace details show that 4800ms of the time is spent in a single, complex database query. - Find the Root Cause: The trace provides the full text of the slow query, the stack trace of the code that executed it, and a link to the database host's performance metrics at that exact time.
- Create an Actionable Ticket: You now have everything you need. You create a Jira ticket for the
dashboard-data-serviceteam with a link to the New Relic trace and the specific slow query that needs to be optimized. The entire investigation took less than 10 minutes.
5. Template: Standard Service Dashboard JSON
This JSON blob is the "Dashboard as Code" template for a standard microservice, managed via Terraform. It can also be imported directly into the New Relic UI.
{
"name": "TEMPLATE - My Service Health",
"permissions": "PUBLIC_READ_WRITE",
"pages": [
{
"name": "Overview",
"widgets": [
{
"title": "API Throughput (Requests per Minute)",
"visualization": "billboard",
"nrqlQueries": [
{ "query": "SELECT rate(count(*), 1 minute) FROM Transaction WHERE appName = 'my-service'" }
]
},
{
"title": "API Error Rate (%)",
"visualization": "gauge",
"nrqlQueries": [
{ "query": "SELECT percentage(count(*), WHERE http.statusCode >= 500) FROM Transaction WHERE appName = 'my-service'" }
],
"thresholds": [ { "value": 1, "alertSeverity": "CRITICAL" } ]
},
{
"title": "API Latency (p95)",
"visualization": "billboard",
"nrqlQueries": [
{ "query": "SELECT percentile(duration, 95) FROM Transaction WHERE appName = 'my-service'" }
]
},
{
"title": "CPU Utilization (%)",
"visualization": "line_chart",
"nrqlQueries": [
{ "query": "SELECT average(cpuPercent) FROM K8sContainerSample WHERE deploymentName = 'my-service-deployment' TIMESERIES" }
]
},
{
"title": "Memory Utilization (MiB)",
"visualization": "line_chart",
"nrqlQueries": [
{ "query": "SELECT average(memoryWorkingSetBytes) / 1048576 FROM K8sContainerSample WHERE deploymentName = 'my-service-deployment' TIMESERIES" }
]
}
]
}
]
}