Tier 3 overnight runs — curated backlog + scheduled + review-on-arrival

Scorecard question: Do you run autonomous overnight runs (Codex Cloud, Cursor Cloud Agents)? Max‑score answer (3 pts): Curated “AI eligible” backlog, scheduled runs, review‑on‑arrival in the morning.

Why this matters in 2026

Tier 3 — fully autonomous, fire‑and‑forget — is the most over‑bought and under‑configured capability in the 2026 engineering stack. Every CTO has heard the “Codex Cloud writes the code while you sleep” pitch and most have met the team that turned it on, watched 40 PRs land Wednesday morning, gave up by lunch, and quietly switched it off three weeks later. The capability is real; the default rollout fails.

It fails for one structural reason: overnight runs are a force multiplier and amplify whatever’s already there. If review bandwidth is a bottleneck, Tier 3 turns it into a wall. If the backlog has no label discipline, the agent picks whatever’s on top — usually whatever’s hardest. Teams that score 3 points got there by curating which tickets are eligible, scheduling runs deliberately, and building a morning ritual that absorbs the output.

Being one tier below counts as zero. Ad‑hoc Cursor Cloud Agents whenever someone feels like it is closer to Tier 2. Tier 3 means the work happens whether or not anyone presses a button, and a system is designed to catch it. By mid‑2026 the capability is commoditized (Codex Cloud scheduled runs, Cursor Cloud Agents on VMs, Claude Code background agents on your own infra). The discipline around it is the differentiator.

What “max score” actually looks like

The three‑point answer has three load‑bearing pieces: label‑driven curation, nightly cron, and morning triage. Drop any one and you fall to two points.

A curated “AI eligible” backlog. Tickets carry an explicit ai-eligible (or agent-ok) label applied during triage by humans who know the codebase. The label means: small, well‑specified, low‑blast‑radius, reversible. Not schema migrations, security‑sensitive paths, billing, or anything where the spec lives in someone’s head.
A queue feeder, not a firehose. The orchestrator pulls a bounded number of tickets per night — typically 5–15, sized to what review capacity can realistically absorb. Cap per repo. Never “pick the next 50”.
Scheduled runs on a real cron. Codex Cloud scheduled tasks, Cursor Cloud Agents from a nightly GitHub Action, or your own feeder. Fixed hour (22:00–02:00 local), deterministic dispatch so the same ticket isn’t worked twice.
One PR per ticket, with structured metadata. Stable title prefix ([nightly] or [codex-cloud]), source ticket linked, diff scoped to one logical change, self‑test summary attached.
Review‑on‑arrival ritual. A named human (or rotation) owns the queue. By 10:00 every PR is triaged into merge / refine / reject. CI has already run; AI review (CodeRabbit, Greptile, Claude Code Action — see Q9 · PR review automation) has left first‑pass comments.
Enforced review SLA. “Nightly PRs reviewed by 11:00, merged or rejected by EOD” — written down, owned, watched. Misses auto‑close the next night.
Hard exclusions on sensitive paths. A CODEOWNERS‑style or .agent-exclude file lists paths the agent can’t touch — auth, payments, migrations, infra manifests, /security/. Fails fast if a ticket needs them.
A dashboard with the math. Weekly: PRs opened, % merged, % rejected, % needing rework, mean review time, $ per merged PR. If merged < 60% or cost/PR drifts above $5–8, tighten the rubric — don’t buy more credits.

Concretely: Friday the team queues 12 tickets. Saturday morning the on‑call sees 9 PRs (three blocked by exclusions), CI green on 7, AI reviewer flagged 2; the human merges 6, refines 2, rejects 1 — all before noon.

Current landscape (web‑search‑verified)

Codex Cloud (OpenAI’s hosted scheduled agents)

Codex Cloud — the hosted backend behind codex on ChatGPT Plus / Pro / Business / Edu / Enterprise — became the workhorse of overnight pipelines in 2026.

Automations (OpenAI’s term for scheduled background tasks) per repo with a fixed prompt, target branch, and cadence. Pulls open issues by label, dispatches one run per ticket in an isolated background worktree, opens a PR per result. Bills against the org’s Codex quota, not individual seats.
Sandboxed worktrees. Fresh checkout of the repo, your AGENTS.md / CLAUDE.md loaded, a bounded lifetime cap. The agent can run tests and iterate inside the sandbox before pushing.
Stable PR contract. Opened by chatgpt-codex-connector[bot] (the GitHub App handle — confirm yours under your org’s GitHub integration), deterministic title prefix, structured body with source issue link, reasoning summary, and self‑test report.
Quotas matter. Business and Enterprise have explicit parallel‑run caps; Pro is individual‑only. Throttle errors mean you’ve outgrown the tier.

Killer setup: one automation per repo, pointing at ai-eligible, capped at 10 dispatches per night. Most teams underuse the cap — the right failure mode.

Cursor Cloud Agents

Cursor Cloud Agents — autonomous, VM‑based, with self‑test and demo recording — cover the same niche with a different rhythm. Typically launched on demand from the IDE or via the Cursor API, not on a fixed cron, which suits bounded batch runs better than pure overnight pipelines.

VM isolation per task. Isolated environment, your repo, your .cursorrules, your shared rules. Installs deps, runs the test suite, records a video demo, iterates until tests pass.
Merge‑ready PRs with demos. Diff, written summary, and (uniquely) a screen recording of the agent demonstrating the change — the killer review artifact.
Better for “15 tickets done by morning” than “every night drain the queue”. Hybrid: senior IC kicks off a batch end of day, agents run overnight, IC reviews next morning.
Cost shape. Billed by run time and tokens. A batch of 10–15 tickets typically runs $40–80 — cheaper than a senior’s hour.

Most teams: Codex Cloud for the recurring cron, Cursor Cloud Agents for the planned batch.

Anthropic Computer Use / agent automation patterns

Anthropic’s lane in late 2026 is less a hosted scheduled product, more the primitives to build your own.

Claude Code background agents via the Agent SDK on your own infrastructure — GitHub Action, Cloudflare Worker, EC2 — pulling tickets from a queue you control. You own orchestration; Anthropic provides the runtime.
Computer Use is overkill for code work but excellent for what Codex Cloud and Cursor don’t touch: visual regression triage, screenshot diffs, third‑party portals, smoke tests against staging.
BYO scheduler. Pair Claude Code with Temporal, GitHub Actions, or plain cron and treat the agent as a callable. More control, more glue code.

If you already trust a job runner, BYO on Claude is most flexible. If you want it working tonight, Codex Cloud is shorter.

What “AI eligible” looks like

The label is the entire game.

Yes by default: Renovate / Dependabot PRs needing code changes, lint or formatter rollouts, doc updates, type annotations, log standardization, dead‑code removal, data-testid attributes, accessibility lint fixes, tests for uncovered functions, README touch‑ups, error message clarifications.
Maybe, with explicit scoping in the ticket: small refactors with clear before/after, single‑file feature flags, i18n copy changes, minor perf tuning with a benchmark, removing a deprecated API call.
Never eligible: auth, payment, billing, encryption, key management, schema migrations, infra‑as‑code with prod blast radius, build pipeline changes, anything without full acceptance criteria, anything cross‑repo or blocking a release.
Rule of thumb: if it’d take a senior more than 30 minutes, or you can’t state success in one sentence, it’s not ai-eligible.

Review‑on‑arrival workflow (filter, batch, merge)

Three steps, 60‑minute budget for 8–12 PRs:

Filter (10 min). Saved GitHub filter is:pr author:chatgpt-codex-connector[bot] is:open created:>{yesterday} (swap in your own bot handle if your org’s GitHub integration uses a different one). Sort by “no human needed” → “AI flagged” → “CI failed”. CI‑failed closes immediately unless next iteration is queued.
Batch (20–30 min). Group survivors by code area. Review easiest cluster first to build momentum (docs, lint), then medium. One‑line approvals.
Refine or reject (10–20 min). One round‑trip: push a small fix and merge, or leave a precise comment and re‑run. Still failing after one cycle → rejected and re‑filed as a human ticket.

Over 90 minutes means curation is too loose.

Copy-paste scoping block for AGENTS.md / CLAUDE.md (this is what keeps nightly PRs reviewable):

Rules for autonomous overnight runs
Section titled “Rules for autonomous overnight runs”

Keep every PR scoped to exactly one logical change. Never refactor, reformat, or “tidy” neighbouring files, even if they look wrong — open a separate ticket instead.

If the ticket cannot be completed without touching auth, payments, billing, encryption, key management, schema migrations, infra-as-code, or any path under /security/, stop and open no PR. Comment on the source issue explaining why it was skipped.

Run the full test suite before pushing. If you add behaviour, add a test for it. If tests fail after one self-correction pass, push the branch as a draft and label it needs-human.

Title every PR [nightly] <imperative summary> and link the source ticket in the body. Include a short self-test report: what you ran, what passed, what you could not verify.

Prefer the smallest diff that satisfies the acceptance criteria. A 60-line PR that a human merges in two minutes beats a 600-line PR that sits for a week.

This is the single highest-leverage artifact for Tier 3: it converts “the agent did something overnight” into “the agent did one reviewable thing per ticket.”

Step‑by‑step: launching nightly runs

Pick the platform first. Codex Cloud if you’re already on ChatGPT Business / Enterprise and want minimum setup. Cursor Cloud Agents for the video demo artifact and bounded batches. Claude Code + your own scheduler if you have a strong DevEx team and want full control. Don’t run more than one in the first 90 days.
Define the ai-eligible rubric. One page in the handbook with the “yes / maybe / never” lists from above, adapted to your codebase. Get sign‑off from at least one senior per major area before publishing. Add the label to Linear / Jira / GitHub Issues.
Pilot with one repo, one label, ten tickets. Pick a low‑stakes repo. Label exactly ten tickets. Don’t schedule yet — trigger a one‑off batch manually and walk through the PRs with the team next morning. What surprised you is the curation gap.
Tune the rubric based on the pilot. If 3 of 10 PRs were “not what we meant”, the rubric has holes. Tighten the language, expand “not eligible”, re‑train triage. Repeat once with fresh tickets. Only when ≥80% are merge‑ready on first review do you move to a schedule.
Wire up the scheduler. Codex: an automation per repo with a label filter and a 10‑dispatch cap. Cursor: nightly GitHub Action querying your tracker and dispatching Cloud Agents up to the cap. Claude Code: your own job runner. Log every dispatch with timestamp, ticket ID, run ID for later correlation.
Add hard exclusions. A .agent-exclude file (or extend CODEOWNERS) listing paths the agent must not touch — auth, payments, billing, migrations, infra manifests, /security/, anything matching *secrets*. The agent should detect a forbidden path and skip the ticket without opening a PR. Test with a deliberate “trap” ticket.
Stand up the morning ritual. Named owner (or rotation). Block 09:00–10:00 in the calendar for “Nightly PR triage”. Save the GitHub filter, document the merge / refine / reject heuristics, write the SLA: every nightly PR has a decision by 11:00. Auto‑close PRs older than 24 hours.
Wire AI review into the same PRs. CodeRabbit, Greptile, Claude Code Action, or Codex Review running on every nightly PR before the human looks. Critical flags = dead on arrival; clean = fast human read. (See Q9 · PR review automation.)
Add the dashboard. Weekly: PRs opened, % merged, % rejected, % needing rework, mean review time, $ spend, mean PR size. Pull from the GitHub API and platform billing into a sheet or Grafana panel. If merged < 60% two weeks running, freeze the schedule and re‑tune.
Expand carefully. 30 days stable on repo #1 → add repo #2. 60 days → consider raising the cap from 10 to 15. 90 days → evaluate a second platform. Each expansion loads the morning ritual; don’t add until it still fits in 60 minutes.
Calendar a quarterly retrospective. Same cadence as your tooling policy review (Q2). Publish deltas; update the rubric.

Common pitfalls

No curation — agent picks anything. You point Codex at “open issues” and it picks the most senior‑sounding ticket because it has the most context. Three PRs in, you’ve eaten review bandwidth on the wrong problem. Never run unfiltered.
No review SLA. Nightly PRs pile up because nobody owns them. By Friday there are 60 stale bot PRs and someone closes them all in frustration. Write the SLA, name the owner, auto‑close after 24 hours.
Runs touching sensitive code paths. The agent edited auth middleware to add a docstring and adjusted a redirect URL. You only catch it because a customer noticed. Enforce hard exclusions before the agent runs.
Treating Tier 3 as Tier 2 + cron. Interactive agents on a schedule isn’t Tier 3 — it’s Tier 2 with more PRs. Tier 3 requires the curation discipline and the morning ritual.
Mega‑PRs. The agent fixes one thing and “while it’s there” refactors a neighbouring file. 800 lines, mixed concerns. Instruct the agent in AGENTS.md / CLAUDE.md to keep PRs scoped; reject everything else on sight.
No cost dashboard. You don’t know nightly runs cost $14/PR until the monthly invoice. Instrument cost per merged PR week one.
Ignoring the AI reviewer’s output. The human skims the diff but doesn’t read what CodeRabbit flagged. The layer is meant to shrink the human’s job; ignoring it doubles the load.
Forgetting the off switch. Holidays, code freezes, release weeks — nobody’s reviewing but cron is still firing. One env var on the dispatcher; on‑call toggles it.
Buying credits before building the ritual. “We bought 200 nightly runs of Codex quota” and there’s no morning routine. Ritual before quota. Always.

How to verify you’re there

A new engineer can describe the ai-eligible rubric in one paragraph from the handbook, including at least three “not eligible” categories.
In two clicks you can show a saved GitHub filter returning last night’s PRs with CI / AI review status.
The morning ritual has a named owner, a calendared slot, and a written SLA with auto‑close fallback. Slack pings the owner if SLA slips.
Hard exclusions are enforced at dispatch, not review. Provable with a “trap” ticket the agent declined.
The weekly dashboard tracks nightly PR throughput, merge %, rework %, and $/merged‑PR. Trends right over 4 weeks.
Merge rate is ≥60%. Lower → you paused the schedule and tightened curation, with change history.
The AI reviewer (Q9) runs on every nightly PR before human triage.
An engineer returning from PTO sees no infinite loop of stale bot PRs — auto‑close kept the queue clean.
Cost per merged PR is bounded and tracked. Anomalies (single PR > $50) are investigated.
A senior IC says, unprompted, the nightly pipeline “actually helps” — less time on dependency bumps and lint than six months ago.