Cloudflare Guide: Our Shield and Accelerator at the Edge
1. Introduction: The First Line of Defense
Cloudflare is our platform's front door to the public internet. It sits at the edge of our network, acting as our primary shield against attacks, a global content delivery network (CDN) to accelerate our applications, and a serverless platform for running code close to our users.
Every public-facing request to the XOPS platform flows through Cloudflare first. Understanding its role and configuration is critical for SRE, security, and front-end development.
Core Mission: To leverage Cloudflare's global network to make our platform faster, more secure, and more reliable.
2. Key Features and Our Usage
We use a wide array of Cloudflare features, which are managed as code wherever possible.
Web Application Firewall (WAF)
- What it is: A firewall that inspects incoming HTTP requests and blocks malicious traffic before it ever reaches our origin servers.
- Our Usage:
- Managed Rulesets: We enable Cloudflare's Managed Rulesets for OWASP Core Rules and technology-specific rules (e.g., for Kubernetes). These block common attack vectors like SQL Injection and Cross-Site Scripting (XSS).
- Custom Rules: We create custom WAF rules to block traffic from known bad IPs or to rate-limit specific, resource-intensive API endpoints.
- Mode: All rules are deployed in
blockmode in production. In staging, they are inlogmode to test for false positives.
Caching & Content Delivery (CDN)
- What it is: Cloudflare caches our static assets (images, CSS, JavaScript) on its thousands of edge servers around the world, serving them directly to users from a location near them.
- Our Usage:
- We use Page Rules to define our caching strategy.
- Cache Level:
Cache Everythingfor static assets. - Edge Cache TTL: Long TTLs (e.g., 7 days) for fingerprinted assets, shorter TTLs for non-fingerprinted assets.
Cloudflare Workers (Serverless at the Edge)
- What it is: A serverless platform that allows us to run JavaScript code directly on Cloudflare's edge network.
- Our Usage:
- A/B Testing: We can deploy a Worker that intercepts requests and routes a percentage of users to a new version of our application, without changing any backend infrastructure.
- Header Manipulation: Workers can add or remove HTTP headers to requests and responses on the fly, for example, to add security headers like
Content-Security-Policy. - Edge Authentication: For some internal tools, a Worker can perform an authentication check at the edge, before the request is even allowed to hit our origin.
3. Configuration as Code: Terraform and Wrangler
All Cloudflare configuration is managed as code. This is non-negotiable. It ensures our setup is version-controlled, auditable, and repeatable.
- Terraform: Used for managing DNS records, WAF rules, Page Rules, and other account-level settings.
- Wrangler CLI: The command-line tool for developing and deploying Cloudflare Workers.
Example: Managing a WAF Rule with Terraform
This Terraform code defines a custom WAF rule that blocks a specific bad IP address.
resource "cloudflare_filter" "block_bad_ip" {
zone_id = var.cloudflare_zone_id
description = "Block traffic from a known malicious IP"
expression = "(ip.src eq 198.51.100.1)"
paused = false
}
resource "cloudflare_firewall_rule" "rule_for_bad_ip" {
zone_id = var.cloudflare_zone_id
description = "Block a bad IP"
filter_id = cloudflare_filter.block_bad_ip.id
action = "block"
priority = 1
}
Example: Deploying a Worker with Wrangler
This command, run from a CI/CD pipeline, deploys a Worker from the project in the current directory.
# wrangler.toml must be configured with the account_id and worker name
npx wrangler deploy
4. Verbose Workflow: Responding to a WAF Security Event
This workflow details how we respond when the WAF blocks a significant, coordinated attack.
- Alerting: Cloudflare is configured to send WAF event data to New Relic. A New Relic alert fires when the number of
WAF Blockevents from a single IP address exceeds 1,000 in 5 minutes. This triggers a P3 security incident in Jira Service Management. - Sparky Triage: Sparky receives the alert and immediately queries the Cloudflare API and other threat intelligence sources for the offending IP address. It adds this context to the JSM ticket.
- Human Analysis: The on-call SRE reviews the JSM ticket and the WAF logs in the Cloudflare dashboard. They analyze the pattern of blocked requests to understand the nature of the attack (e.g., a DDoS attack, a credential stuffing attempt).
- Remediation:
- If it's a simple, single-source attack, the engineer can create a new Terraform-managed WAF rule to block the IP or ASN permanently.
- If it's a complex, distributed attack, they may need to create a more sophisticated custom rate-limiting rule.
- Implementation: The engineer opens a Pull Request with the updated Terraform code. It is reviewed by another SRE and, once approved and merged, deployed via our standard CI/CD pipeline.
- Verification: The engineer monitors the WAF logs in Cloudflare and the alert in New Relic to confirm that the malicious traffic is being blocked and the alert has cleared.
5. Performance Monitoring with Cloudflare
Cloudflare provides valuable performance metrics that we feed into our overall monitoring picture.
- Cache Hit Ratio: We monitor our cache hit ratio in New Relic. A sudden drop can indicate a misconfiguration and will increase load on our origin servers.
- Argo Smart Routing: We use Argo to intelligently route traffic across the fastest paths on the Cloudflare network. We monitor the "Time Saved" by Argo as a key performance indicator.
- Real User Monitoring (RUM): Cloudflare Browser Insights provides another source of RUM data, which we compare against our primary RUM tool (New Relic Browser) to get a complete picture of front-end performance.