FinOps: Financial Operations for the Cloud
1. Introduction: Every Engineer is a Cost Owner
In the cloud, every line of code can have a direct financial impact. Financial Operations (FinOps) is the cultural practice of bringing financial accountability to the variable spend model of the cloud. It is a partnership between engineering, finance, and product teams to manage our cloud costs effectively, ensuring we spend every dollar wisely.
This is not about cutting costs indiscriminately. It's about making data-driven decisions to maximize the business value of our cloud investment. Every engineer and team at XOPS is expected to understand and take ownership of their service's cost and efficiency.
Core Mission: To instill a culture of cost-consciousness and empower teams with the visibility, tools, and best practices to optimize cloud spend for efficiency and business value.
2. Our FinOps Principles
- Teams are Accountable: The teams that build and run services are responsible for managing their cloud costs.
- Visibility is Key: We cannot manage what we cannot see. We provide accessible, real-time visibility into cloud spend for all teams.
- Centralized Governance, Decentralized Execution: A central FinOps team provides governance, best practices, and negotiated discounts. The individual engineering teams execute the optimizations.
- Data-Driven Decisions: All cost optimization decisions are based on data from our observability and cost management tools.
- Business Value over Raw Cost: The goal is not always the cheapest solution, but the most cost-effective one. A more expensive service that provides significant business value or performance improvement is a good investment.
3. Cloud Budgeting and Accountability
- Annual Budgeting: The FinOps team works with product and engineering leadership to set an overall annual cloud budget, aligned with business forecasts.
- Team-Level Showback: While we don't have strict team-level budgets (to avoid stifling innovation), we do have a "showback" model. Every month, each team receives a report detailing the costs incurred by their services.
- Cost Tagging is Mandatory: Our cost accountability model relies entirely on a consistent and mandatory tagging policy for all cloud resources. Any resource deployed without the correct tags is automatically flagged and quarantined.
- Required Tags:
owner-team: The engineering team responsible (e.g.,knowledge-graph-team).service-name: The name of the microservice or component.environment:prod,staging,dev, ortest.
- Required Tags:
4. Cost Monitoring and Anomaly Detection
We use a combination of tools to monitor our spend and catch surprises before they become problems.
- AWS Cost Explorer: Used by the FinOps team for high-level, strategic analysis of our overall AWS bill, trend analysis, and Reserved Instance/Savings Plan coverage.
- New Relic Billing Integration: This is our primary tool for real-time cost monitoring. We ingest detailed billing data into New Relic, allowing us to build dashboards and alerts just like any other metric.
- Automated Anomaly Detection: We have a New Relic alert that runs every 6 hours. It uses the
prediction()NRQL function to forecast the expected cost for each service. If the actual cost deviates from the forecast by more than 20%, it automatically creates aCostticket in Jira and assigns it to the owning team for investigation.
Example Anomaly Alert NRQL:
SELECT latest(aws.cost) FROM AwsBillingSample WHERE serviceName = 'MyService' FACET ownerTeam SINCE 1 day ago COMPARE WITH 1 week ago
5. The Cost Optimization Cookbook
This is a living list of approved strategies for optimizing service costs.
Strategy 1: Right-Sizing
- What it is: Matching instance size and type to workload performance and capacity needs.
- How to do it:
- Use the New Relic infrastructure agent to analyze the
p95CPU and memory utilization for your service's pods/instances over a 14-day period. - If utilization is consistently below 30%, the instance is a candidate for downsizing.
- If CPU is the bottleneck, consider switching to a Compute-Optimized instance type (e.g., AWS
cseries). - Always test the performance impact of a change in staging before deploying to production.
- Use the New Relic infrastructure agent to analyze the
Strategy 2: Adopt ARM-based Graviton Processors
- What it is: Migrating workloads to AWS's custom ARM-based Graviton processors, which offer significantly better price-performance for many workloads.
- How to do it:
- Build and push a multi-architecture Docker image for your service.
- Update your Kubernetes deployment configuration to use a Graviton-based node pool (e.g.,
m7ginstances). - This is a low-risk change for most of our standard Python/Go applications.
Strategy 3: Implement Storage Lifecycle Policies
- What it is: Automatically transitioning data to cheaper storage tiers as it ages.
- How to do it:
- For S3 buckets used for log archiving, create a lifecycle policy that moves data from
S3 StandardtoS3 Infrequent Accessafter 30 days, and then toS3 Glacier Deep Archiveafter 90 days. - This policy is managed via Terraform.
- For S3 buckets used for log archiving, create a lifecycle policy that moves data from
Strategy 4: Leverage Reserved Instances (RIs) and Savings Plans
- What it is: Committing to a certain level of usage over a 1 or 3-year term in exchange for a significant discount from AWS.
- Who does it: This is managed centrally by the FinOps team. They analyze our platform's stable, predictable usage (our "baseload") and purchase RIs and Savings Plans to cover it, maximizing our discounts.
6. Calculating Cost per Tenant
Understanding the cost to serve a specific customer is a critical business metric. While we don't have a perfect, real-time system for this, we use a quarterly estimation model.
- Identify Shared Costs: The total cost of all shared infrastructure (e.g., Kubernetes control planes, monitoring tools, networking) is calculated.
- Identify Tenant-Specific Costs: The cost of any resources that are dedicated to a single tenant (e.g., a dedicated database cluster) is calculated.
- Allocate Shared Costs: Shared costs are allocated to tenants based on a primary usage metric, typically the volume of API requests or the amount of data stored in the Knowledge Graph.
- Calculate Total Cost: The total cost for a tenant is their direct, tenant-specific cost plus their allocated portion of the shared costs.
This model is run by the FinOps team in partnership with Data Science and provided to the Customer Success and Product teams.