Skip to main content

GitHub Actions Guide: CI/CD and Automation

1. Introduction: The Engine of Our Development Lifecycle

GitHub Actions is the automation engine at the heart of our software development lifecycle. It builds, tests, and deploys every component of the XOPS platform. It is also a key tool for automating operational tasks, integrating seamlessly with Sparky and our broader infrastructure.

This guide defines our best practices for creating CI/CD pipelines, security scanning, and other automated workflows. Adherence to these standards is mandatory for all services to ensure consistency, security, and reliability.

Core Mission: To provide a fast, secure, and reliable path from code commit to production deployment, while empowering developers to automate repetitive tasks.


2. Usage in Handbook Sections

GitHub Actions is woven into the fabric of our engineering processes:

  • SRE and Monitoring/Availability: A failed deployment pipeline is an availability issue. We monitor the health of our GitHub Actions workflows as a critical service.
  • Proactive Problem Resolution: Sparky's L2 fixes are delivered as GitHub Pull Requests, which are then built, tested, and deployed by GitHub Actions.
  • Resilience and Testing: Chaos experiments are automatically triggered via scheduled GitHub Actions. Our CI pipelines also run comprehensive test suites (unit, integration, e2e) on every commit.
  • Platform Operations and Maintenance: Scheduled workflows handle tasks like database backups, security scans, and dependency updates.

3. The Standard CI/CD Pipeline

All services in the XOPS platform must have a GitHub Actions workflow that follows this standard, multi-stage process. We use reusable workflows to ensure consistency.

Key Stages Explained:

  1. Run Tests & Scans: On every push to a feature branch, this job runs. It includes:

    • Unit Tests: pytest or jest.
    • Static Analysis: ruff, eslint.
    • Security Scanning: trufflehog for secrets, semgrep for code patterns.
    • Dependency Scanning: FOSSA scan to check for vulnerable or non-compliant licenses. A PR cannot be merged if any of these checks fail.
  2. Build & Push Container: Once a PR is merged to main, a new workflow triggers. It builds a Docker container, tags it with the commit SHA, and pushes it to our AWS ECR registry.

  3. Deploy to Staging: The new container is automatically deployed to our staging Kubernetes environment.

  4. Run E2E & Chaos Tests: After deployment to staging, a suite of end-to-end tests are run against the live staging environment. We also trigger a small-scale, pre-defined chaos experiment to ensure the new release has not introduced any resilience regressions.

  5. Manual Approval for Production: This is a critical control gate. A senior engineer or team lead must manually approve the production deployment after reviewing the test results from staging. This is done directly in the GitHub Actions UI.

  6. Deploy to Production: Upon approval, the container is deployed to the production Kubernetes environment using a canary deployment strategy.


4. Verbose Workflow: Sparky's L2 Fix in Action

This workflow shows how GitHub Actions enables Sparky's AI-generated fixes.

  1. Sparky Creates PR: As detailed in the Sparky guide, Sparky detects an issue and creates a PR with an AI-generated fix.
  2. CI Checks Trigger: GitHub Actions automatically runs the full suite of tests and scans on Sparky's proposed code. This is our safety net, ensuring the AI's code meets our quality and security standards.
  3. Human Review: The on-call engineer is notified. They review the code, which is already accompanied by a green checkmark from the CI pipeline. This dramatically increases their confidence in the fix.
  4. Merge and Deploy: The engineer approves and merges the PR. The standard CI/CD pipeline takes over, deploying the fix to production within minutes.

5. Template: Standard CI Workflow for a Python Service

This YAML template should be the starting point (.github/workflows/ci.yml) for any new Python-based microservice.

name: Standard CI/CD

on:
push:
branches:
- main
pull_request:

jobs:
test-and-scan:
name: Run Tests and Scans
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'

- name: Install dependencies
run: pip install -r requirements.txt

- name: Run unit tests
run: pytest

- name: Run linter (Ruff)
run: ruff check .

- name: Security Scan (Semgrep)
uses: returntocorp/semgrep-action@v1
with:
publishToken: ${{ secrets.SEMGREP_APP_TOKEN }}

deploy:
name: Build and Deploy
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
needs: test-and-scan
# ... further steps for build, push to ECR, and deploy ...
# This section would call reusable workflows for deployment
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE_ARN }}
aws-region: ${{ secrets.AWS_REGION }}

- name: Build and Push to ECR
uses: ./.github/actions/ecr-build-push
with:
image-name: ${{ github.event.repository.name }}
tag: ${{ github.sha }}

- name: Deploy to Staging
uses: ./.github/actions/k8s-deploy
with:
environment: 'staging'
image-tag: ${{ github.sha }}

This template uses other reusable actions (ecr-build-push, k8s-deploy) which are defined centrally in the infra-observability repository to ensure deployment logic is consistent and secure.


6. AI-Assisted PR Reviews with Claude

To improve code quality, accelerate the review process, and catch issues before they reach human reviewers, we employ an AI-assisted review step on every Pull Request. This workflow uses the Claude AI model to perform both a general code review and a targeted security scan.

The AI Review Workflow

This action is non-blocking but provides invaluable feedback to both the developer and the eventual human reviewer.

  1. PR Opened: A developer opens a new Pull Request.
  2. Action Triggered: A dedicated GitHub Actions workflow, claude-review.yml, is triggered by the pull_request event.
  3. Get PR Diff: The action checks out the code and gets the diff of the changes from the base branch.
  4. Parallel Reviews:
    • (a) Code Review: The code diff is sent to the Claude API with a carefully crafted prompt, asking it to act as a senior engineer and review the code for quality, bugs, and style.
    • (b) Security Scan: The action first runs a static security analysis using the bandit tool on the changed Python files.
  5. Claude Security Analysis: The output from bandit, along with the code diff, is sent to the Claude API with a different prompt, asking it to review the code specifically for security vulnerabilities that bandit might have missed.
  6. Post Comments: The action uses the GitHub API to post the formatted responses from both Claude reviews as comments on the Pull Request.

Implementation Details

Claude Code Review Prompt

A good prompt is key to getting a good review.

System Prompt: You are a senior staff engineer at a world-class technology company. You are performing a code review on a pull request. The user will provide you with the code diff. Please review it for the following:

  • Bugs or logical errors: Are there any potential runtime errors or logical flaws?
  • Best practices and style: Does the code adhere to modern best practices (e.g., is it clean, readable, and maintainable)?
  • Performance: Are there any obvious performance bottlenecks?
  • Suggestions for improvement: Are there any alternative implementations that might be cleaner, more efficient, or more idiomatic?

Please provide your feedback in a concise, clear, and constructive manner. Format your response in Markdown. If you have no issues to raise, simply respond with "LGTM!"

Claude Security Review Prompt

This prompt is more focused.

System Prompt: You are a principal security engineer specializing in application security. You are reviewing a pull request for potential vulnerabilities. You will be given the code diff and the output from a bandit static analysis scan. Your task is to:

  1. Analyze the bandit output.
  2. Perform a deeper review of the code diff, looking for more subtle security issues that static analysis might miss (e.g., race conditions, insecure business logic, missing authentication/authorization checks, indirect command injection).
  3. Provide a clear and actionable security report in Markdown. For each finding, state the vulnerability, the potential impact, and a recommended mitigation. If you find no issues, simply respond with "No security issues found."

Template: claude-review.yml

This workflow runs the review process.

name: AI PR Review

on:
pull_request:
types: [opened, synchronize]

jobs:
claude-review:
name: Run Claude AI Reviews
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
with:
fetch-depth: 0 # Fetches all history for diffing

- name: Get PR Diff
id: pr-diff
run: |
# This script gets the diff and sets it as a GitHub Action output
echo "diff=$(git diff origin/${{ github.base_ref }} ${{ github.sha }})" >> $GITHUB_OUTPUT

- name: Run Claude Code Review
# This step would call our internal script that uses the AnthropicVertex client
# It passes the diff from the previous step and the code review prompt
run: |
python ./scripts/run_claude_review.py \
--diff "${{ steps.pr-diff.outputs.diff }}" \
--prompt-type "code-review" \
--pr-number ${{ github.event.number }} \
--repo-name ${{ github.repository }}
env:
CLAUDE_API_KEY: ${{ secrets.CLAUDE_API_KEY }}

- name: Run Claude Security Review
# This step calls a different script that first runs bandit
# and then sends the combined output to Claude for analysis
run: |
python ./scripts/run_claude_security_review.py \
--diff "${{ steps.pr-diff.outputs.diff }}" \
--prompt-type "security-review" \
--pr-number ${{ github.event.number }} \
--repo-name ${{ github.repository }}
env:
CLAUDE_API_KEY: ${{ secrets.CLAUDE_API_KEY }}

This automated review process acts as a powerful "pre-flight check" for our human review process, ensuring that by the time an engineer looks at a PR, the simple issues have already been found and flagged.