AI in CI/CD — agent in the path of every commit

Q13 · Quality gates Are AI agents part of CI/CD itself (beyond devs running them locally)?

Max-score answer: “Full pipeline: agent generates, agent reviews, agent tests, agent deploys — with hard gates.”

Why it matters: Local-only agents are a productivity story; pipeline agents are an engineering org story. The compounding wins start when AI is in the path of every commit.

Why this matters in 2026

Local coding agents are now common, but adoption varies by organization. Q13 asks whether selected agent workflows also run on the server side—on repository events, schedules, or incident triggers—with auditable permissions and hard gates. Increased code-generation capacity can move the bottleneck toward review and integration; measure that shift in your own PR and CI data instead of assuming a universal throughput multiplier.

Underneath the throughput number is a compounding story. With AI only on the laptop, the org gets N × per_dev_gain. With AI in the pipeline, it gets N × per_dev_gain + pipeline_gain — and pipeline_gain doesn’t depend on which dev remembered to run the agent. Docs stay in sync because a workflow updates them on merge. Lint debt doesn’t accumulate because a schedule job fixes it nightly. Dependency PRs are merged by an agent that verified the upgrade. That org-level compounding is what the Strategic Leader tier measures.

The local-vs-pipeline asymmetry also shapes how compliance and security see AI. A local agent is invisible — every change comes from a developer account, the audit trail looks like 2020. A pipeline agent is the opposite: every action is a workflow run with a logged trigger, token scope, diff, and outcome. That is the AI your security team can actually approve.

What “max score” actually looks like (4-phase pipeline)

The pipeline-agent maturity ladder has four phases. The Strategic Leader tier on Q13 runs all four — and runs them under hard gates that block merge or deploy on real failures.

Phase 1 — Generate. Agents create code inside the pipeline, not just review it. Examples: a workflow_dispatch job that takes a Linear ID and opens a draft PR; a schedule job that pulls ai-eligible tickets overnight; a Sentry-triggered workflow that opens a fix PR when an issue first fires in prod; a Renovate dependency PR augmented by an agent making the call-site changes. Output is a normal PR — branch, commits, body, labels — that flows through the rest of the pipeline.

Phase 2 — Review. Agents review every PR (human-written or agent-written) before a human looks at it. The layered pattern from Q9: CodeRabbit / Greptile for static analysis; Claude Code Action / Codex Cloud / Cursor Cloud for architecture and intent; Sentry Seer for regression risk. None block merge on their own — that’s what hard gates are for — but together they cover roughly 70% of the line-by-line review surface.

Phase 3 — Test. Agents run the tests, not just write them. They generate missing unit tests for new code paths. They run the E2E suite (Playwright, browser-use, Cypress) and diagnose failures by reading traces. They run security scans (Semgrep, CodeQL, Trivy) and either fix or open a hardening PR. They run performance budgets (Lighthouse, k6) and surface regressions with repro steps. The point isn’t that an agent runs the test — CI already does. The point is that an agent triages the failure.

Phase 4 — Deploy. Agents own the last mile: merge, deploy, post-deploy verification, rollback on regression. gh pr merge --squash --delete-branch --auto queues the merge once required checks pass; a deploy workflow ships it; a post-deploy job hits canary, watches error rate and latency, and either promotes to 100% or rolls back. The supervising agent reads Sentry, PostHog, and canary metrics. On regression it rolls back and opens an incident PR with context.

Anything weaker — “agents review PRs but nobody runs them in CI,” “Claude Code workflow only fires on @-mention,” “the agent writes code but a human always merges” — caps Q13 at mid-tier.

Current landscape (web-search-verified)

Claude Code GitHub Action

Claude Code GitHub Actions is the official Anthropic action (code.claude.com/docs/en/github-actions) bringing Claude into CI. It can be invoked by supported PR/issue events and runs Claude Code headlessly with repository instructions and permissions bounded by the workflow token. Prompt caching and parallel subagents can reduce repeated work, but savings depend on cache eligibility, token mix, and workflow shape.

For higher-risk repositories, consider runner hardening such as egress controls, runtime audit logs, and workflow-file tamper detection. Choose controls from the repository’s threat model rather than treating one third-party action as universally required.

Codex Cloud (scheduled + on-trigger)

OpenAI Codex Cloud runs in a sandboxed cloud environment instead of plugging directly into CI. Assign tasks from ChatGPT, Codex CLI, or the IDE extension; Codex can branch, run tests, and prepare a PR. For recurring work, Codex tasks and automations in ChatGPT desktop can run saved prompts on a schedule, while event-driven handoffs can use standard GitHub Actions. Measure success, review time, and rework on your own curated backlog; there is no universal success-rate band for unattended fixes.

Cursor Cloud Agents

Cursor’s cloud agent (the IDE-flavoured counterpart to Codex Cloud) extends Cursor’s local agent into background runs: pick a ticket, send it to the cloud, it opens a PR. The 2026 integration story is that Cursor, Codex, and Claude Code increasingly interoperate — Cursor MCP into Codex via Composio, Claude Code from Cursor, vice versa. The CTO implication: don’t pick one and commit your whole pipeline — run different phases on different agents and let the PR be the integration surface.

Hard gates (security scan, perf budget, cost cap)

Pipeline agents without hard gates are how you end up with a --no-verify-shaped incident in production. Three gates are non-negotiable:

Security gate. Semgrep + CodeQL on every PR; workflow fails on new high-severity findings. Trivy (or Snyk) on every container build; build fails on new high-CVE deps. Pipeline agents run under these gates, not around them.
Performance gate. Lighthouse CI on every preview deploy with tight budgets (e.g. LCP ≤ 2.5s, INP ≤ 200ms, CLS ≤ 0.1). For backend changes, k6 with p95 latency budgets. Failing budget blocks merge.
Cost gate. AI cost cap — Claude Code Action workflows have a per-run token budget that fails the run if exceeded. Runtime cost cap — infra changes go through a workflow that estimates the new monthly bill (Workers requests, D1 reads, R2 egress, KV ops) and fails if it crosses a threshold without an --accept-cost-increase label.

Agent generates docs (auto-update READMEs from code changes)

The highest-leverage Phase 1 workflow once the pipeline is in place: on every merge that changes a function signature, public endpoint, env var, CLI flag, or schema, a workflow re-reads the affected files and regenerates the relevant doc sections — README, MIGRATIONS, OpenAPI, CLAUDE.md, Starlight pages. Commit back to main directly or open a tiny follow-up PR. Half-life of “docs slightly out of date” drops from quarters to hours.

Agent fixes lint / dependency PRs (Renovate + AI fix)

Renovate-plus-agent is the second-highest-leverage Phase 1 workflow. Renovate opens dependency PRs (you’ve had this for years). What’s new in 2026: wire a Claude Code or Codex workflow to augment those PRs — when Renovate posts the bump, the agent runs the suite, reads failures, makes the call-site changes the new version requires, commits to Renovate’s branch, and only then turns the PR green. Same pattern for ESLint / Ruff / Prettier rule changes. Lint and dependency debt stop accumulating because fixing is mechanical.

Step-by-step: building the AI-in-pipeline workflow

Pick generation triggers first. Before wiring any agent into CI, decide which PR-creating events agents handle. The minimum sensible set: Renovate dependency PRs, Sentry-triggered fix PRs, a schedule’d overnight backlog run pulling ai-eligible Linear tickets, and a manual workflow_dispatch lane. Anything more is bonus.
Wire Claude Code GitHub Action for review/fix lanes. Add anthropics/claude-code-action to .github/workflows/claude.yml. Trigger on pull_request, on issue_comment containing @claude, and on labels like needs-fix/ai-review. Pass the repo’s CLAUDE.md into context, set a per-run token budget, and harden the runner with StepSecurity’s harden-runner ahead of the Claude step.
Add Codex Cloud for the overnight generation lane. Point a Codex Cloud environment at the repo. Set up the schedule trigger (nightly, weekend) and the on-trigger webhook from Sentry. Tag Linear tickets ai-eligible. Define a per-task token budget and a sandbox timeout so a stuck task can’t run for 6 hours.
Layer review bots in front of the agent. CodeRabbit (or Greptile) on every PR for static analysis. Claude Code Action review for the architecture pass. Sentry Seer on PRs touching recently-flagged error surfaces. Order matters: cheap fast bots first, expensive architectural agent second.
Install hard gates. Semgrep + CodeQL that fail on new high-severity findings. Lighthouse CI on every preview with explicit Core Web Vitals budgets. Trivy on every container build. A token-budget step in every agent workflow. Mark all four as required checks in branch protection.
Auto-merge with --auto. When the re-check loop settles with no actionable comments and required checks pass, queue gh pr merge --squash --delete-branch --auto. --auto waits for pending checks; it does not override branch protection. Never --admin-override from a workflow. On failure or requested changes, surface the blocker and stop.
Add post-deploy verification. On every prod deploy, a follow-up workflow hits canary URLs, watches Sentry’s release API for the first 10 minutes, watches PostHog for funnel regressions, and rolls back via wrangler rollback on regression. The rollback opens an incident PR with context.
Add the docs-sync workflow. A paths-filtered workflow that runs on merges touching API surface (e.g. src/pages/api/**, src/lib/db/schema.ts). Claude (headless) regenerates the relevant doc sections and commits back.
Add the Renovate-plus-agent workflow. Fires when Renovate opens a dependency PR, runs the suite, and on failure invokes Claude Code on Renovate’s branch to fix call-site changes. Renovate’s PR is green before a human looks at it.
Observe everything. Pipe workflow telemetry into the AI metrics panel (Q22): token spend per workflow, success/failure rate, time-to-first-comment, time-to-merge, % of PRs touched by an agent, % of agent-touched PRs that merged. Without this you cannot prove ROI to your CFO at year-end.

Common pitfalls

No hard gates. Adding pipeline agents without security, performance, and cost gates is how you ship a --no-verify incident at scale. Without Semgrep, CodeQL, Lighthouse, and a token cap on every agent workflow, you don’t have a max-score Q13.
Runaway agents. A workflow without a per-run token budget can burn $400 of tokens on one PR. Set a token cap on every agent step, a sandbox timeout on every Codex Cloud task, and a workflow-level concurrency cap so a runaway scheduled run can’t queue twenty copies of itself.
No observability. If you can’t tell me how many PRs an agent touched last week, what they spent, and how many merged, the agent is invisible. Wire workflow telemetry into the AI metrics panel from day one.
Agent merging to main without checks. gh pr merge --auto is the right move; gh pr merge --admin is never the right move from an automated workflow. The whole point of --auto is that it waits for required checks.
Mixing generation and review in one workflow. When the same workflow opens and reviews the PR, the review is structurally compromised. Separate the lanes; share no state beyond the PR itself.
No fallback for agent failure. When Claude Code Action fails (rate-limit, outage, sandbox crash), configure continue-on-error: true on the agent step where appropriate, post a clear “agent did not run, human review required” comment, and don’t block merge on the agent’s absence — block on the hard gates.
Pipeline agents only on one repo. Build the workflows once, vendor them into every repo via a dotfiles-style template. Otherwise scores diverge wildly per project.
No opt-out. A skip-ai label or [skip-ai] commit marker that bypasses agent workflows (but not hard gates) is essential — without it, agents become friction instead of leverage.