Monitoring Framework for the XOPS Architecture

1. Introduction: A Deeply Observable System

The XOPS platform is a complex, distributed system composed of many interacting services. To operate it reliably, we must have deep visibility into the health and performance of every single component. This is not just about collecting data; it's about building a cohesive, end-to-end picture of our system's behavior in real time.

This document details our framework for monitoring the XOPS architecture. It specifies what we monitor, how we collect the telemetry, and how that data is used to ensure stability and performance.

Core Mission: To make our entire architecture "glass box," where the internal state of every component is observable, understandable, and actionable.

2. The Telemetry Flow: From Source to Insight

All telemetry data follows a standardized path, ensuring that we can correlate information from different sources to get a complete picture of any event.

What We Collect: The Three Pillars of Observability

We standardize on the "three pillars" of observability for all components.

1. Metrics (The Numbers)

What they are: Aggregated, numerical data about the performance of a system over time.
Examples:
- Application Metrics: API request rate, error rate (http.statusCode >= 500), latency percentiles (p50, p90, p95, p99) for specific API endpoints. Examples include SELECT rate(count(*), 1 minute) FROM Transaction WHERE appName = 'my-service' and SELECT percentile(duration, 95) FROM Transaction WHERE appName = 'my-service'.
- Infrastructure Metrics: Pod CPU/Memory utilization, disk I/O, network bandwidth per pod/node, Kubernetes ingress/egress traffic, pod restarts. Use queries like SELECT average(cpuPercent) FROM K8sContainerSample WHERE deploymentName = 'my-service-deployment'.
- Knowledge Graph Metrics: Query execution times (average, p95), ingestion rates (events/sec), graph query complexity scores, data freshness (time since last update for key entities), connection pool utilization. Examples: SELECT avg(query_time_ms) FROM KnowledgeGraphQueries and SELECT rate(count(*), 1 minute) FROM KgIngestionEvents.
- Cerebro/Sparky Metrics: Model inference latency, prediction success rates, number of automated resolutions initiated/completed, feedback loop latency.
- Business Metrics: Number of active users, integrations configured per tenant, runbooks executed per incident.
How we collect them: Services expose metrics via Prometheus-compatible endpoints. The New Relic infrastructure agent scrapes these, and application-specific metrics are collected via APM agents.

2. Logs (The Story)

What they are: Timestamped, structured text records of discrete events, errors, and informational messages.
Standard Log Format: All services must log in a structured JSON format. Each log entry must include:
- timestamp: ISO 8601 format (e.g., 2026-01-11T10:00:00.123Z).
- level: e.g., INFO, WARN, ERROR, DEBUG.
- service_name: Name of the originating service.
- trace_id: Crucial for correlating logs with metrics and traces across services.
- span_id: For tracing within a specific service's operations.
- message: The log event description.
- customer_id / tenant_id: If applicable, for customer-specific logging.
- deployment_id: To link logs to a specific release version.
- host: The node or pod name where the service is running.

Example Log Entry:

{
  "timestamp": "2026-01-11T10:00:00.123Z",
  "level": "ERROR",
  "service_name": "knowledge-graph-api",
  "trace_id": "trace-abc123def456",
  "span_id": "span-789ghi012",
  "message": "Failed to connect to database",
  "db.host": "db.prod.internal",
  "db.error": "connection refused",
  "customer_id": "cust_xyz789",
  "deployment_id": "deploy-v2.5.1-7d8a9b0"
}

How we collect them: Logs are sent to the central Observability API Gateway (MCP) for initial processing and enrichment, which then forwards them to New Relic Logs.

3. Traces (The Journey)

What they are: A detailed record of a single request's lifecycle as it traverses multiple services. Traces capture the timing and dependencies of distributed operations.
Examples: A trace for a user logging into Control Center might start at Cloudflare (edge), hit the MCP API Gateway, go to Auth0 (external IDP), then to the Authentication Service, then to the Knowledge Graph API (via GraphQL), and finally to the graph database.
How we collect them: We use OpenTelemetry SDKs embedded in all our services. trace_id and span_id are propagated across service boundaries via HTTP headers (e.g., traceparent). Traces are sent to the Observability API Gateway (MCP) and then to New Relic APM.

3. The MCP (API Gateway Facade) and Knowledge Graph GraphQL

Our architecture includes an MCP (Management Control Plane) API Gateway that serves as the primary ingress point for client interactions, routing requests to downstream services, including the Knowledge Graph's GraphQL endpoint.

Management Control Plane (MCP) API Gateway

Role: The MCP acts as a secure, high-performance facade, centralizing common concerns before requests reach backend services.
Key Functions:
- Authentication & Authorization: Validates API keys/tokens, enforces access control policies for all incoming requests.
- Rate Limiting & Throttling: Protects backend services from abuse, ensuring fair usage and stability. Configured based on service criticality and customer tiers.
- Request Routing: Intelligently directs traffic to appropriate backend services based on request path, headers, or query parameters. This includes routing GraphQL requests to the KG endpoint and REST requests to other platform services.
- TLS Termination: Handles SSL/TLS encryption and decryption.
- Telemetry Ingestion: Acts as the central hub for receiving metrics, logs, and traces from all services before forwarding them to New Relic and the Knowledge Graph.
Monitoring: The MCP itself is a critical service and is heavily monitored in New Relic for latency, error rates, request volume, and connection counts. Alerts on MCP issues are P1 due to its broad impact.

Knowledge Graph GraphQL Endpoint

Role: Provides a flexible, efficient, and strongly-typed API for querying and manipulating data within the Knowledge Graph. It offers a more granular and efficient data retrieval mechanism compared to traditional REST APIs for complex data relationships.
Usage:
- Internal Services: Cerebro, Sparky, and the Autonomous Engine use GraphQL to fetch rich contextual data and to update entities.
- External Clients: For specific customer-facing dashboards or partner integrations, a secure GraphQL API endpoint is exposed, routed via the MCP.
- Postman is used for manual exploration and automated testing of GraphQL queries and mutations.
Monitoring:
- New Relic APM: The GraphQL service is instrumented to capture metrics such as query latency (per query type), error rates (e.g., GraphQL errors vs 5xx HTTP errors), overall throughput, and request complexity scores.
- Query Analysis: We monitor for excessively complex or resource-intensive GraphQL queries. Cerebro can analyze query patterns to identify candidates for optimization or to inform schema design.
- Health Checks: A dedicated /graphql/health endpoint verifies the service's health and its ability to connect to the underlying graph database and execute a simple introspection query.

4. Health Checks: The Pulse of Our System

Health checks are the primary mechanism by which we determine if a service is "up" or "down" and ready to serve traffic. We use several types.

Health Check Type	Description	How it's Used	Example
Liveness Probe	A simple check to see if the application process is running and responsive.	Kubernetes uses this to know when to restart a container. A failed liveness probe means the pod is broken and needs to be killed.	An HTTP endpoint (`/healthz`) that returns `200 OK` if the web server process is running and not in a crash loop.
Readiness Probe	A more comprehensive check to see if the application is ready to serve traffic.	Kubernetes uses this to know whether to include a pod in a service's load balancer. A failed readiness probe means the pod is temporarily unable to serve requests but can potentially recover.	An HTTP endpoint (`/readyz`) that returns `200 OK` only if the service can successfully connect to its primary database, establish a connection to the Knowledge Graph GraphQL endpoint (if it's a dependency), and has completed any necessary startup initialization.
Deep Health Check	A transactional check that validates critical functionality of a service.	These are run periodically by a dedicated monitoring service (e.g., New Relic Synthetics) or via Postman Collections in CI/CD. A failure triggers an alert.	- Knowledge Graph API (GraphQL): Execute a simple, read-only GraphQL introspection query (e.g., `query { _service { id } }`) and verify a successful response structure. This checks GraphQL endpoint health and basic graph connectivity. - Observability API: Send a small, dummy metric/log event and verify it appears in New Relic Logs within 60 seconds. - MCP API Gateway: Perform a health check against an internal service routed through the MCP and verify the response time is within acceptable limits.
Inter-Component Heartbeats	A push-based model where a service periodically reports its health to a central system.	Services like Cerebro and the Autonomous Engine send a periodic heartbeat signal to the MCP API Gateway. If a heartbeat is missed for a defined period (e.g., 2 missed heartbeats), the MCP marks the service as unhealthy and can trigger alerts or further automated actions.	The Autonomous Engine sends a POST request to `/api/v1/heartbeat` with its current status, version, and telemetry endpoint URL every 30 seconds.

5. End-to-End Monitoring Workflow in Action

Collection: Every component in the architecture (Applications, Platform Services like the MCP and KG, Infrastructure) is instrumented with OpenTelemetry and configured to emit telemetry (Metrics, Logs, Traces) to the central Observability API Gateway (MCP). External dependencies also report their health via synthesized health checks and synthetic monitors.
Processing & Storage:
- The Observability API Gateway forwards this telemetry to New Relic for real-time visualization, alerting, and APM tracing.
- Simultaneously, logs and metrics are processed and sent to the Knowledge Graph, where they are structured into the silver layer and augmented with business context and AI insights into the gold layer. GraphQL queries are the primary way to access this rich, structured data.
Analysis & Alerting:
- New Relic analyzes the real-time stream for threshold breaches on our key SLIs (e.g., latency, error rates). If an issue is found, it triggers an alert in PagerDuty.
- Cerebro continuously analyzes the historical, augmented data in the Knowledge Graph to identify subtle patterns, detect anomalies, and predict future failures. If a predictive pattern is found, Cerebro triggers a proactive alert.
Triage and Resolution:
- Sparky receives the alert from either New Relic or Cerebro.
- It uses the trace_id, service_name, and customer_id from the alert to query the Knowledge Graph for all related telemetry (logs, metrics, traces) for that specific event and affected customer.
- This complete, correlated picture allows Sparky to perform its L1/L2/L3 triage with high accuracy, leading to faster and safer resolutions.

1. Introduction: A Deeply Observable System​

2. The Telemetry Flow: From Source to Insight​

What We Collect: The Three Pillars of Observability​

1. Metrics (The Numbers)​

2. Logs (The Story)​

3. Traces (The Journey)​

3. The MCP (API Gateway Facade) and Knowledge Graph GraphQL​

Management Control Plane (MCP) API Gateway​

Knowledge Graph GraphQL Endpoint​

4. Health Checks: The Pulse of Our System​

5. End-to-End Monitoring Workflow in Action​