AWS Guide: Powering the Cerebro Cognitive Platform
1. Introduction: The Foundation for Our Intelligence
While the XOPS platform is designed to be cloud-agnostic at the infrastructure level (running on Kubernetes), our advanced AI capabilities are powered by a curated set of best-in-class services from Amazon Web Services (AWS). AWS provides the foundational models, training infrastructure, and inferencing engines for our cognitive platform, Cerebro.
This guide focuses specifically on how we leverage AWS for AI, not on our general cloud infrastructure usage (which is covered in other sections like SRE and Platform Operations). It details our use of services like Amazon Bedrock and Amazon SageMaker.
Core Mission: To use AWS's powerful AI services to build, train, and serve the machine learning models that make Cerebro, and by extension Sparky, intelligent, predictive, and effective.
2. Usage in Handbook Sections
Our AWS AI stack is a critical enabler for several key processes:
- Proactive Problem Resolution: Cerebro uses models hosted on SageMaker and Bedrock to predict failures before they occur.
- Workflows and Sparky Integration: Sparky consults Cerebro (and thus AWS) to get recommendations for L2 AI-generated fixes.
- Performance Tuning: Cerebro analyzes performance data using models trained on SageMaker to identify optimization opportunities.
- Customer Success and QBRs: Insights for QBRs are generated by Cerebro by processing customer data with models that run on AWS.
- Service Chain Monitoring: Anomaly detection within the service chain is performed by ML models running on our AWS stack.
3. Architectural Overview: Bedrock vs. SageMaker
We use both Bedrock and SageMaker, and it's crucial to understand the distinction and when to use each.
Amazon Bedrock: For Access to Foundational Models
- What it is: A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies (like Anthropic, Meta, etc.) via a single API.
- When we use it:
- For general-purpose, large language model (LLM) tasks.
- Generating human-readable explanations for complex issues.
- Code generation for Sparky's L2 fixes.
- Summarizing incident data for post-mortems.
- Why we use it: It abstracts away the complexity of hosting and scaling large, expensive FMs. We can switch between models like Anthropic's Claude 3 Sonnet and Meta's Llama 3 with minimal code changes to find the best tool for the job.
Amazon SageMaker: For Custom-Trained Models
- What it is: A comprehensive platform for building, training, and deploying our own custom machine learning models.
- When we use it:
- When we need to train a model on our own proprietary XOPS data from the Knowledge Graph.
- For predictive tasks that are highly specific to our platform's behavior.
- Example: Our "predictive failure" model is trained on terabytes of historical telemetry from our own services. A general-purpose FM cannot do this.
- Why we use it: It gives us full control over the model architecture, training data, and deployment, allowing us to build highly accurate, specialized models for our unique operational challenges.
4. Verbose Workflow: Training and Deploying a New Predictive Model
This workflow details how we take an idea for a new predictive capability and turn it into a production SageMaker endpoint that Cerebro can use.
-
Hypothesis & Data Gathering:
- Idea: "We believe we can predict service mesh latency spikes by analyzing network packet data."
- Data Sourcing: An ML engineer queries the
goldlayer of the Knowledge Graph to create a labeled dataset of historical network data and corresponding latency spikes.
-
Model Development (SageMaker Studio):
- The engineer uses a SageMaker Studio Notebook to experiment with different model architectures (e.g., LSTM, Transformer).
- They train the model on the dataset, iterating until they achieve the desired accuracy.
-
Model Training Job (SageMaker Training):
- Once the model is finalized, the training process is codified into a script.
- A formal SageMaker Training Job is launched. This runs the training script on a powerful, dedicated cluster and stores the resulting model artifact in an S3 bucket. This ensures training is repeatable and auditable.
-
Model Deployment (SageMaker Endpoint):
- The trained model artifact is deployed to a SageMaker Endpoint. This exposes the model as a secure, scalable, and low-latency REST API.
- We configure auto-scaling for the endpoint to handle variations in load from Cerebro.
-
Integration with Cerebro:
- A new "provider" is added to the Cerebro codebase that knows how to call this new SageMaker endpoint.
- Cerebro can now use this new predictive capability as part of its analysis, for example, to trigger a proactive alert before a latency spike actually occurs.
-
Monitoring and Retraining:
- We monitor the model's accuracy and data drift using SageMaker Model Monitor.
- A scheduled GitHub Actions workflow automatically re-triggers the SageMaker Training Job every quarter with fresh data from the Knowledge Graph to ensure the model stays up-to-date.
5. Security and Cost Management Best Practices
- IAM Roles: All access to Bedrock and SageMaker is controlled through fine-grained IAM roles. The Cerebro service has a specific role that only allows it to invoke the production model endpoints. Engineers have separate roles for development and training.
- VPCs: SageMaker endpoints are deployed within our private VPC, and are not exposed to the public internet. Access is controlled via VPC endpoints.
- Cost Tagging: All AWS resources (SageMaker endpoints, training jobs, S3 buckets) are tagged with
service: cerebroandenvironment: [prod|staging|dev]. This allows us to accurately track the cost of our AI platform. - Model Pruning: We have a quarterly review process to decommission and delete old or underperforming model endpoints to control costs.