Platform Operations and Maintenance: Keeping the Engine Running
1. Introduction: The Foundation of Reliability
Platform Operations and Maintenance are the set of routine, yet critical, activities that ensure the XOPS platform remains stable, secure, and efficient. While incident response and performance tuning are about reacting to specific events, this discipline is about the proactive, scheduled work that prevents those events from happening in the first place.
This guide details our standardized processes for everything from routine health checks and security audits to access control and ecosystem maintenance.
Core Mission: To maintain a state of continuous health, security, and compliance across the entire platform through automated, repeatable, and auditable operational procedures.
2. Periodic Checks: Our Rhythm of Proactive Care
We perform a series of automated and manual checks at regular intervals. These tasks are tracked as scheduled issues in Jira Service Management.
Weekly Checks (Automated via Sparky)
- Trigger: Scheduled GitHub Action every Monday at 09:00 UTC.
- Action: Sparky executes the
weekly-health-scanplaybook. - Checklist:
- [✅] Full System Health Scan: Trigger deep health checks for all P1 services.
- [✅] Stale Pods/Resources: Scan all Kubernetes namespaces for pods or other resources that have been in a non-running state for > 24 hours.
- [✅] Backup Verification: Verify that the latest database backups were completed successfully and run a test restore on a temporary, isolated instance.
- [✅] SSL Certificate Expiry: Scan all public-facing endpoints for SSL certificates expiring in the next 30 days.
- Output: Sparky posts a summary report to the
#sre-weekly-reportSlack channel. If any check fails, a P3 incident is automatically created in JSM.
Monthly Checks
- Trigger: Scheduled JSM ticket on the first business day of the month.
- Owner: On-call SRE.
- Checklist:
- [✅] Performance Audit: Review the p95 latency and resource utilization trends for all P1 services over the last 30 days. Escalate any negative trends to the Performance Tuning process.
- [✅] Cost Anomaly Review: Review the AWS cost report for the previous month. Investigate any significant, unexpected cost increases.
- [✅] Capacity Planning Review: Check current resource utilization against our defined capacity limits. Is any service approaching 80% of its CPU, memory, or storage allocation?
Quarterly Checks
- Trigger: Scheduled JSM ticket at the start of each quarter.
- Owner: SRE Team Lead.
- Checklist:
- [✅] Security Review & Penetration Testing: (See workflow below).
- [✅] Access Control Audit: (See workflow below).
- [✅] FOSSA Report Review: Review the latest open source dependency and license report from FOSSA. Create tickets for any medium-risk issues that need to be addressed.
- [✅] Post-Mortem Review: Review all P1/P2 incident post-mortems from the previous quarter. Are there any recurring themes or systemic issues that need to be addressed?
Annual Checks
- Trigger: Scheduled JSM ticket every November.
- Owner: Head of SRE / CTO.
- Checklist:
- [✅] Disaster Recovery (DR) Test: Perform a full, announced DR failover between our primary and secondary AWS regions.
- [✅] Compliance Audit: Engage with our external auditors to provide evidence for SOC 2, ISO 27001, etc. This includes reports from FOSSA, JSM, and our security scanning tools.
3. Verbose Workflow: Quarterly Security Penetration Testing
- Planning: A "Penetration Test" epic is created in Jira. The SRE team, in collaboration with a third-party security vendor, defines the scope of the test (e.g., "This quarter, we will focus on the Control Center and the Platform API Gateway").
- Scheduling: A maintenance window is scheduled in PagerDuty to ensure that alerts generated by the pen test do not trigger a real incident response.
- Execution: The security vendor (or an internal red team) executes a series of automated and manual attacks against the staging and production environments. Tools may include OWASP ZAP, Burp Suite, and custom scripts.
- Detection & Logging: All attack attempts are logged by our security infrastructure (Cloudflare WAF, Sentry, etc.). The SRE team monitors the system to ensure there is no real-world customer impact.
- Reporting: The security vendor provides a detailed report of their findings, including the severity and reproducibility of each vulnerability.
- Remediation: For each finding, a
Vulnerabilityissue is created in Jira and assigned to the appropriate engineering team. These issues are high-priority and have strict resolution SLAs.
4. Verbose Workflow: Quarterly Access Control Audit
Maintaining the principle of least privilege is paramount. This quarterly audit ensures that only the right people have the right access to the right resources.
- Data Source: Our single source of truth for infrastructure access is StrongDM.
- Automation Trigger: A scheduled GitHub Action runs a script that uses the StrongDM API to export a list of all users and the roles/resources they have access to.
- Analysis: The script compares this export to the list of active employees from our HR system (e.g., Workday).
- Finding: Any user who has access in StrongDM but is no longer an active employee is flagged for immediate de-provisioning.
- Finding: Any user who has had "temporary" elevated access for more than 30 days is flagged for review.
- Certification: For all remaining users, an email is sent to their manager asking them to certify that the user still requires their current level of access.
- Remediation:
- Access for terminated employees is automatically revoked by Sparky via the StrongDM API.
- Managers who do not certify their team's access within 7 days have their team's access temporarily suspended.
- Reporting: A final report is generated and attached to the quarterly JSM audit ticket.
5. Access Control and Key Rotation
- Access Management: All access to production infrastructure (databases, Kubernetes clusters, SSH) is managed exclusively through StrongDM. Direct access is forbidden.
- Automated Key Rotation: We practice aggressive, automated rotation of all secrets and credentials.
- Frequency: All database passwords, API keys, and other secrets are rotated every 90 days.
- Mechanism: We use HashiCorp Vault as our central secrets store. A scheduled GitHub Action triggers a Vault workflow that:
- Generates a new password/key.
- Updates the secret in both Vault and the target system (e.g., the database).
- Triggers a rolling restart of any application pods that need to pick up the new secret.
- Sparky's Role: If a service fails to restart after a key rotation, Sparky immediately triggers an incident and attempts to perform an automated rollback to the previous secret version to restore service while the issue is investigated.