Skip to content

Pipeline Automation with AI

Your monorepo CI runs every job on every push. A one-line README edit triggers a 38-minute build of all six services, the test job goes red 30% of the time on the same three flaky specs, and last Friday’s deploy to Cloudflare Workers shipped a regression that nobody caught until support tickets rolled in. You don’t need “AI-powered self-healing pipelines” — you need the build to only run what changed, the flaky tests quarantined, and a deploy that rolls itself back when error rates spike.

This article shows the concrete workflow for getting an AI agent to write those pipeline changes for you, across all three tools, with the real GitHub Actions YAML they produce.

  • A dorny/paths-filter change-detection job that skips unaffected services (and the prompt that generates it)
  • A flaky-test triage workflow: feed the agent a failing run, get back a quarantine list and a retry policy
  • A progressive Cloudflare Workers deploy with wrangler versions + a gradual rollout and automatic rollback
  • Copy-paste prompts for each, tuned for Cursor, Claude Code, and Codex
  • The failure modes that bite when you let an agent edit CI, and how to recover

The single highest-leverage CI change in a monorepo is to stop building everything. dorny/paths-filter reads the diff and sets per-path outputs you gate jobs on. Ask the agent to write the filter against your actual directory layout, not a template.

The agent should produce something like this — a real, runnable job, not a description of one:

.github/workflows/ci.yml
name: CI
on:
pull_request:
push:
branches: [main]
jobs:
changes:
runs-on: ubuntu-latest
outputs:
api: ${{ steps.filter.outputs.api }}
web: ${{ steps.filter.outputs.web }}
steps:
- uses: actions/checkout@v5
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
api:
- 'services/api/**'
- 'packages/shared/**'
web:
- 'services/web/**'
- 'packages/shared/**'
test-api:
needs: changes
if: ${{ needs.changes.outputs.api == 'true' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- run: npm ci && npm run test --workspace=services/api

Note packages/shared/** appears under both filters: a change to shared code correctly rebuilds both consumers. That cross-dependency mapping is exactly what you want the agent to infer from your repo rather than hand-maintaining.

The interaction differs by tool. Cursor edits the YAML inline in the editor; Claude Code runs headless in the terminal and can verify with gh; Codex runs the same prompt as a non-interactive exec job or as a Cloud task off a GitHub issue.

In Agent mode, open the repo and point it at the real layout. Cursor reads package.json workspaces and the services/ tree, then writes .github/workflows/ci.yml as an inline diff you review before accepting:

Add a changes job using dorny/paths-filter@v3 to .github/workflows/ci.yml. Derive one filter per package in my workspaces array, and make every filter also match packages/shared/**. Gate each existing test job behind its matching needs.changes.outputs.* value. Use actions/checkout@v5.

Use a checkpoint before accepting so you can revert the whole edit in one click if the gating is wrong.

Flaky tests don’t need an AI “stability analyzer” — they need someone to look at the last N runs, find the specs that fail nondeterministically, quarantine them, and open a ticket. That “someone” can be the agent, because it can read a failing run’s logs directly through the GitHub MCP server.

The GitHub MCP server is a remote Streamable HTTP endpoint. Install it once — the config is identical across all three tools since MCP is a shared standard:

Terminal window
claude mcp add --transport http github https://api.githubcopilot.com/mcp/

With it connected, the agent can pull the failed job’s logs instead of you copy-pasting them. The before/after is the whole point: without the MCP server you paste a log dump; with it, you say “the last run of CI on this PR” and the agent fetches the annotations, the failing spec names, and the surrounding output itself.

In Agent mode with the GitHub MCP server enabled, reference the failing run and let Cursor pull the logs, then propose a quarantine diff for your test config:

The last CI run on this branch failed. Pull the failing job’s logs via the GitHub MCP server, list the specs that fail intermittently (passing on retry), and mark them with test.skip plus a // FLAKY: <run-url> comment. Open a checklist of what you skipped.

The Workflow: Progressive Deploy with Auto-Rollback

Section titled “The Workflow: Progressive Deploy with Auto-Rollback”

A binary deploy either fully succeeds or fully fails. A progressive deploy ships the new version to a slice of traffic, watches error rates, and rolls back automatically if they spike. On Cloudflare Workers this is built into wrangler versions — no service mesh required.

Have the agent write a deploy job that uploads a new version, routes 10% of traffic to it, and gates the full rollout on a health check:

# .github/workflows/deploy.yml (excerpt)
- name: Upload new version
run: npx wrangler versions upload --json > version.json
- name: Canary 10% of traffic
run: |
VID=$(jq -r '.id' version.json)
npx wrangler versions deploy "$VID@10%" "${PREV}@90%" --yes
- name: Health gate
run: ./scripts/check-error-rate.sh # exits non-zero if 5xx rate > threshold
- name: Promote to 100%
if: success()
run: npx wrangler versions deploy "$(jq -r .id version.json)@100%" --yes
- name: Rollback on failure
if: failure()
run: npx wrangler rollback --message "Auto-rollback: health gate failed"

Write .github/workflows/deploy.yml for a Cloudflare Worker. Use wrangler versions upload, then wrangler versions deploy to send 10% of traffic to the new version, run scripts/check-error-rate.sh as a gate, promote to 100% on success, and wrangler rollback on failure. Pin actions/checkout@v5.

For agent-driven CI edits, the cross-file reasoning matters more than raw speed — the agent has to read your workspace layout, infer dependencies, and produce valid YAML in one shot. When budget matters less than getting it right in one pass, use Claude Fable 5 (/model fable in Claude Code, or the Cursor model picker) — it is Anthropic’s most capable model and excels at exactly this kind of multi-file work. When budget matters, use Claude Opus 4.8 for the change-detection and progressive-deploy passes and drop to Claude Sonnet 4.6 for the more mechanical flaky-test triage. See model comparison for pricing and a full capability ladder. Codex runs on GPT-5.5 by default across CLI, IDE, and Cloud; use gpt-5.2-codex when you authenticate the CLI with an API key (as the GitHub Action does above).

CI is the one place where a confidently wrong agent edit costs you a broken main. The failure modes are specific.

If an agent-generated workflow fails to parse, don’t ask the agent to “fix the YAML” blind — paste the exact gh workflow view or Actions error back in. Pinning every action to a major version (@v5, @v3) also prevents the classic “it worked yesterday” break when a floating tag ships a breaking change.