Skip to content

AI in CI/CD — agent in the path of every commit

Q13 · Quality gates Are AI agents part of CI/CD itself (beyond devs running them locally)?

Max-score answer: “Full pipeline: agent generates, agent reviews, agent tests, agent deploys — with hard gates.”

Why it matters: Local-only agents are a productivity story; pipeline agents are an engineering org story. The compounding wins start when AI is in the path of every commit.

In 2026 every serious engineering org has agents on developer laptops. That part is settled. What separates a Coordinated org from a Strategic Leader on Q13 is whether those agents also live on the server side — running on push, on pull_request, on schedule, on workflow_dispatch, on Sentry alerts, on cron, on CodeRabbit triggers. The AI Engineering Report 2026 (telemetry from 22,000 developers across 4,000 teams) found median time in PR review up 441% year over year. The reason: local agents quadrupled PR throughput while CI stayed the same. The bottleneck moved from “writing the code” to “everything between push and merge.” Pipeline agents are the only thing that scales that side too.

Underneath the throughput number is a compounding story. With AI only on the laptop, the org gets N × per_dev_gain. With AI in the pipeline, it gets N × per_dev_gain + pipeline_gain — and pipeline_gain doesn’t depend on which dev remembered to run the agent. Docs stay in sync because a workflow updates them on merge. Lint debt doesn’t accumulate because a schedule job fixes it nightly. Dependency PRs are merged by an agent that verified the upgrade. That org-level compounding is what the Strategic Leader tier measures.

The local-vs-pipeline asymmetry also shapes how compliance and security see AI. A local agent is invisible — every change comes from a developer account, the audit trail looks like 2020. A pipeline agent is the opposite: every action is a workflow run with a logged trigger, token scope, diff, and outcome. That is the AI your security team can actually approve.

What “max score” actually looks like (4-phase pipeline)

Section titled “What “max score” actually looks like (4-phase pipeline)”

The pipeline-agent maturity ladder has four phases. The Strategic Leader tier on Q13 runs all four — and runs them under hard gates that block merge or deploy on real failures.

Phase 1 — Generate. Agents create code inside the pipeline, not just review it. Examples: a workflow_dispatch job that takes a Linear ID and opens a draft PR; a schedule job that pulls ai-eligible tickets overnight; a Sentry-triggered workflow that opens a fix PR when an issue first fires in prod; a Renovate dependency PR augmented by an agent making the call-site changes. Output is a normal PR — branch, commits, body, labels — that flows through the rest of the pipeline.

Phase 2 — Review. Agents review every PR (human-written or agent-written) before a human looks at it. The layered pattern from Q9: CodeRabbit / Greptile for static analysis; Claude Code Action / Codex Cloud / Cursor Cloud for architecture and intent; Sentry Seer for regression risk. None block merge on their own — that’s what hard gates are for — but together they cover roughly 70% of the line-by-line review surface.

Phase 3 — Test. Agents run the tests, not just write them. They generate missing unit tests for new code paths. They run the E2E suite (Playwright, browser-use, Cypress) and diagnose failures by reading traces. They run security scans (Semgrep, CodeQL, Trivy) and either fix or open a hardening PR. They run performance budgets (Lighthouse, k6) and surface regressions with repro steps. The point isn’t that an agent runs the test — CI already does. The point is that an agent triages the failure.

Phase 4 — Deploy. Agents own the last mile: merge, deploy, post-deploy verification, rollback on regression. gh pr merge --squash --delete-branch --auto queues the merge once required checks pass; a deploy workflow ships it; a post-deploy job hits canary, watches error rate and latency, and either promotes to 100% or rolls back. The supervising agent reads Sentry, PostHog, and canary metrics. On regression it rolls back and opens an incident PR with context.

Anything weaker — “agents review PRs but nobody runs them in CI,” “Claude Code workflow only fires on @-mention,” “the agent writes code but a human always merges” — caps Q13 at mid-tier.

Claude Code GitHub Actions is the official Anthropic action (code.claude.com/docs/en/github-actions) bringing Claude into your workflow as a first-class CI agent. Invoked by @claude mentions in PR/issue comments, by labels (needs-fix, ai-review), or by scheduled workflow_dispatch triggers. Under the hood it runs the same Claude Code CLI you use locally — same CLAUDE.md, same skills, same hooks — in headless mode with -p, so your local context follows the agent into the pipeline. The 2026 build supports prompt caching across workflow runs (≈90% cost reduction on the second run), parallel sub-agents via the Claude Agent SDK, and a managed permission model where the workflow’s GitHub token bounds what the agent can touch.

Serious shops harden the action with StepSecurity’s harden-runner — egress firewall, runtime audit logs, tamper detection on workflow files. Claude Code itself operates without network restrictions out of the box, which is why the firewall is essential.

OpenAI Codex Cloud runs its own sandboxed cloud environment instead of plugging into your CI. Assign tasks from ChatGPT, Codex CLI, or the IDE extension; Codex branches, commits, runs tests in the sandbox, and opens a PR back into GitHub. The 2026 build (codex remote-control, multi-environment view_image, Bedrock AWS auth) added schedule (cron) and on-trigger (webhook — Sentry alert, GitHub event). Reported success rate on well-scoped fix-with-clear-repro tickets is 85–90% — the band where you can let it run overnight against a curated backlog without the morning being a graveyard of broken PRs.

Cursor’s cloud agent (the IDE-flavoured counterpart to Codex Cloud) extends Cursor’s local agent into background runs: pick a ticket, send it to the cloud, it opens a PR. The 2026 integration story is that Cursor, Codex, and Claude Code increasingly interoperate — Cursor MCP into Codex via Composio, Claude Code from Cursor, vice versa. The CTO implication: don’t pick one and commit your whole pipeline — run different phases on different agents and let the PR be the integration surface.

Hard gates (security scan, perf budget, cost cap)

Section titled “Hard gates (security scan, perf budget, cost cap)”

Pipeline agents without hard gates are how you end up with a --no-verify-shaped incident in production. Three gates are non-negotiable:

  • Security gate. Semgrep + CodeQL on every PR; workflow fails on new high-severity findings. Trivy (or Snyk) on every container build; build fails on new high-CVE deps. Pipeline agents run under these gates, not around them.
  • Performance gate. Lighthouse CI on every preview deploy with tight budgets (e.g. LCP ≤ 2.5s, INP ≤ 200ms, CLS ≤ 0.1). For backend changes, k6 with p95 latency budgets. Failing budget blocks merge.
  • Cost gate. AI cost cap — Claude Code Action workflows have a per-run token budget that fails the run if exceeded. Runtime cost cap — infra changes go through a workflow that estimates the new monthly bill (Workers requests, D1 reads, R2 egress, KV ops) and fails if it crosses a threshold without an --accept-cost-increase label.

Agent generates docs (auto-update READMEs from code changes)

Section titled “Agent generates docs (auto-update READMEs from code changes)”

The highest-leverage Phase 1 workflow once the pipeline is in place: on every merge that changes a function signature, public endpoint, env var, CLI flag, or schema, a workflow re-reads the affected files and regenerates the relevant doc sections — README, MIGRATIONS, OpenAPI, CLAUDE.md, Starlight pages. Commit back to main directly or open a tiny follow-up PR. Half-life of “docs slightly out of date” drops from quarters to hours.

Agent fixes lint / dependency PRs (Renovate + AI fix)

Section titled “Agent fixes lint / dependency PRs (Renovate + AI fix)”

Renovate-plus-agent is the second-highest-leverage Phase 1 workflow. Renovate opens dependency PRs (you’ve had this for years). What’s new in 2026: wire a Claude Code or Codex workflow to augment those PRs — when Renovate posts the bump, the agent runs the suite, reads failures, makes the call-site changes the new version requires, commits to Renovate’s branch, and only then turns the PR green. Same pattern for ESLint / Ruff / Prettier rule changes. Lint and dependency debt stop accumulating because fixing is mechanical.

Step-by-step: building the AI-in-pipeline workflow

Section titled “Step-by-step: building the AI-in-pipeline workflow”
  1. Pick generation triggers first. Before wiring any agent into CI, decide which PR-creating events agents handle. The minimum sensible set: Renovate dependency PRs, Sentry-triggered fix PRs, a schedule’d overnight backlog run pulling ai-eligible Linear tickets, and a manual workflow_dispatch lane. Anything more is bonus.

  2. Wire Claude Code GitHub Action for review/fix lanes. Add anthropics/claude-code-action to .github/workflows/claude.yml. Trigger on pull_request, on issue_comment containing @claude, and on labels like needs-fix/ai-review. Pass the repo’s CLAUDE.md into context, set a per-run token budget, and harden the runner with StepSecurity’s harden-runner ahead of the Claude step.

  3. Add Codex Cloud for the overnight generation lane. Point a Codex Cloud environment at the repo. Set up the schedule trigger (nightly, weekend) and the on-trigger webhook from Sentry. Tag Linear tickets ai-eligible. Define a per-task token budget and a sandbox timeout so a stuck task can’t run for 6 hours.

  4. Layer review bots in front of the agent. CodeRabbit (or Greptile) on every PR for static analysis. Claude Code Action review for the architecture pass. Sentry Seer on PRs touching recently-flagged error surfaces. Order matters: cheap fast bots first, expensive architectural agent second.

  5. Install hard gates. Semgrep + CodeQL that fail on new high-severity findings. Lighthouse CI on every preview with explicit Core Web Vitals budgets. Trivy on every container build. A token-budget step in every agent workflow. Mark all four as required checks in branch protection.

  6. Auto-merge with --auto. When the re-check loop settles with no actionable comments and required checks pass, queue gh pr merge --squash --delete-branch --auto. --auto waits for pending checks; it does not override branch protection. Never --admin-override from a workflow. On failure or requested changes, surface the blocker and stop.

  7. Add post-deploy verification. On every prod deploy, a follow-up workflow hits canary URLs, watches Sentry’s release API for the first 10 minutes, watches PostHog for funnel regressions, and rolls back via wrangler rollback on regression. The rollback opens an incident PR with context.

  8. Add the docs-sync workflow. A paths-filtered workflow that runs on merges touching API surface (e.g. src/pages/api/**, src/lib/db/schema.ts). Claude (headless) regenerates the relevant doc sections and commits back.

  9. Add the Renovate-plus-agent workflow. Fires when Renovate opens a dependency PR, runs the suite, and on failure invokes Claude Code on Renovate’s branch to fix call-site changes. Renovate’s PR is green before a human looks at it.

  10. Observe everything. Pipe workflow telemetry into the AI metrics panel (Q22): token spend per workflow, success/failure rate, time-to-first-comment, time-to-merge, % of PRs touched by an agent, % of agent-touched PRs that merged. Without this you cannot prove ROI to your CFO at year-end.

  • No hard gates. Adding pipeline agents without security, performance, and cost gates is how you ship a --no-verify incident at scale. Without Semgrep, CodeQL, Lighthouse, and a token cap on every agent workflow, you don’t have a max-score Q13.
  • Runaway agents. A workflow without a per-run token budget can burn $400 of tokens on one PR. Set a token cap on every agent step, a sandbox timeout on every Codex Cloud task, and a workflow-level concurrency cap so a runaway scheduled run can’t queue twenty copies of itself.
  • No observability. If you can’t tell me how many PRs an agent touched last week, what they spent, and how many merged, the agent is invisible. Wire workflow telemetry into the AI metrics panel from day one.
  • Agent merging to main without checks. gh pr merge --auto is the right move; gh pr merge --admin is never the right move from an automated workflow. The whole point of --auto is that it waits for required checks.
  • Mixing generation and review in one workflow. When the same workflow opens and reviews the PR, the review is structurally compromised. Separate the lanes; share no state beyond the PR itself.
  • No fallback for agent failure. When Claude Code Action fails (rate-limit, outage, sandbox crash), configure continue-on-error: true on the agent step where appropriate, post a clear “agent did not run, human review required” comment, and don’t block merge on the agent’s absence — block on the hard gates.
  • Pipeline agents only on one repo. Build the workflows once, vendor them into every repo via a dotfiles-style template. Otherwise scores diverge wildly per project.
  • No opt-out. A skip-ai label or [skip-ai] commit marker that bypasses agent workflows (but not hard gates) is essential — without it, agents become friction instead of leverage.
  • Agents create PRs in CI (not just on laptops) — minimum: Renovate-augmenting, Sentry-triggered, scheduled overnight backlog.
  • Every PR (human or agent) is reviewed by at least two of CodeRabbit/Greptile, Claude Code Action, Codex Cloud, Sentry Seer before a human reads it.
  • Semgrep + CodeQL + Trivy block merge on new high-severity findings.
  • Lighthouse CI on every preview with explicit Core Web Vitals budgets blocks merge on regression.
  • Every agent workflow has a per-run token budget that fails the run if exceeded.
  • gh pr merge --auto --squash --delete-branch is queued automatically once the re-check loop settles. Never --admin-override from a workflow.
  • Post-deploy verification watches Sentry releases, PostHog funnels, and canary metrics, and rolls back on regression.
  • A docs-sync workflow regenerates docs when API surface changes — README, OpenAPI, CLAUDE.md.
  • Renovate PRs are augmented by an agent that fixes call-site changes before the PR turns green.
  • Your AI metrics panel (Q22) shows per-workflow token spend, success rate, and PR throughput — and your CFO has seen it.