Cost visibility — full FinOps for token spend
Scorecard question: What can you see about costs (token spend per dev, per repo, per PR)? Max‑score answer (3 pts): Full FinOps: per‑dev + per‑repo + per‑PR + alerts + model tagging.
Why this matters in 2026 (refactor cost variability)
Section titled “Why this matters in 2026 (refactor cost variability)”Token spend is the new compute spend. The difference is that compute spend, even in the cloud era, was relatively well‑behaved per unit of work — an EC2 instance ran for a measurable hour, a Lambda invocation had a known runtime, and the worst‑case variance between “cheap” and “expensive” was usually within an order of magnitude. Token spend is not like that. The cost of a single refactor depends on which model was selected (Haiku vs Sonnet vs Opus), how big the context window grew, whether prompt caching hit or missed, how many tool calls the agent decided to fire, and whether someone left a background agent running through lunch. Without per‑PR and per‑repo visibility you cannot tell whether a refactor cost $5 or $500 — and increasingly, the variance between the two is the thing that wrecks budgets.
The FinOps Foundation’s 2026 State of FinOps report finds that 98% of respondents now manage AI costs, up from 31% in 2024, but AI cost visibility remains the top challenge across the 1,192 organizations surveyed. The dominant failure mode is what finance teams call “the consolidated invoice problem”: Anthropic, OpenAI, Cursor, and Copilot send a single monthly line item to AP, the number trends materially upward every quarter, and nobody can allocate it to the cost centers responsible. The CFO sees the bill grow 4× in a year, asks engineering to justify it, and engineering’s best answer is “the agents are doing more”. That answer stops working at the point where AI tooling is your second or third largest engineering line item — which, for most teams with 20+ engineers on Claude Code or Cursor Ultra, is exactly where you are in 2026.
If you scored 0 or 1 on Q4, you have an invoice‑level view: a monthly aggregate per vendor, maybe per seat. That is not FinOps. That is bookkeeping. Three points means you can answer the question every CFO, audit committee, and FP&A partner will eventually ask: “which 10% of our engineers, repos, or PRs is driving 50% of the spend, and is that spend producing proportional output?” If you cannot answer it on a Tuesday afternoon in under five minutes, you do not have cost visibility — you have a hope and an invoice.
What “max score” actually looks like
Section titled “What “max score” actually looks like”A real “full FinOps” dashboard view for AI coding spend in 2026 looks like this, queryable on demand and refreshed at least daily:
- Per‑developer panel. Token spend by engineer for the trailing 7/30/90 days, broken down by model (Sonnet, Opus, GPT‑5, Haiku) and by surface (Claude Code CLI, Cursor Agent, Codex CLI, Copilot). Each row links to the engineer’s recent sessions or PRs so the spend can be inspected, not just totalled. You can identify the top quintile of spenders and the bottom quintile, and you treat both as signal (top spenders may be your highest throughput developers; bottom spenders may not be using the tools at all).
- Per‑repo panel. Spend by repository for the same time slices, with cost per merged PR as the headline metric. Visible repos are tagged by criticality (production, internal, experimental) so the dashboard can show “$X/week is going into experimental playgrounds that produced zero merged PRs”, which is exactly the kind of finding that pays for the whole FinOps program in week one.
- Per‑PR view. Every merged PR has a cost attribution — total tokens, total dollars, breakdown by model — posted as a comment on the PR itself or visible in the PR list. A PR labelled
ai-authoredshows what it actually cost to produce. Expensive outliers (the $300 PR) get flagged automatically for retrospective review. - Model tagging. Every API call carries enough metadata to attribute it back to a developer, a repo, a feature flag, or a ticket. The minimum tag set is
developer_id,repo_id,pr_number,model,purpose(refactor / new feature / review / chat / exploration). When you cannot tag at the API layer (because the vendor controls the call), you reconcile from the vendor’s admin API by user email + timestamp + nearest commit. - Alerts and hard caps. Spend alerts fire at multiple thresholds: per‑developer daily ceiling, per‑repo weekly ceiling, organisation monthly ceiling. Soft alerts notify; hard caps automatically rotate API keys, downgrade default models, or freeze new sessions. Engineering leadership receives a Friday digest with the top 10 cost outliers (PRs, repos, developers) for sanity‑check.
- Cost‑per‑feature linkage. Every Linear/Jira ticket can be cross‑referenced with the AI cost incurred while it was open. When the head of product asks “what did the checkout v2 epic actually cost us in AI?”, you can give a number, not a shrug.
- One number on the wall. The headline metric — typically cost per merged PR — is updated weekly, visible to every engineer in a Slack/Teams channel, and trended over time. When it spikes, someone investigates; when it falls, you know the model‑routing change or caching tweak paid off.
The shape of “max score” is not about the specific tool you use. It is about whether you can answer the per‑dev, per‑repo, per‑PR question on demand with confidence — and whether your alerts fire before, not after, you cross a budget line.
Current landscape (web‑search‑verified)
Section titled “Current landscape (web‑search‑verified)”Native dashboards (Anthropic Console, Cursor admin)
Section titled “Native dashboards (Anthropic Console, Cursor admin)”The native dashboards are the first‑class source of truth and the cheapest path to some per‑user visibility.
- Anthropic Console. Admin Usage & Cost API (mid‑2025) plus the Enterprise Analytics API (March 2026) give per‑user, per‑day attribution with up to 90 days of history. For Claude Code, that includes commits, pull requests, and lines of code per user — which is where the per‑developer view starts. Console “Usage” tab also breaks down by API key, so if you scope keys per repo or per service, you get a free per‑repo dimension. The headline gap: out of the box there is no per‑PR slicing unless you tag.
- Cursor admin dashboard. Team and Business plans expose per‑member usage with credit consumption per model. The admin console can export CSVs you can pipe into your warehouse. Cursor’s June 2025 move to usage‑based credits means every action has a credit cost the admin can see — useful, but the API is still less mature than Anthropic’s.
- OpenAI usage exports. ChatGPT Business and Codex Cloud both expose per‑user usage in the admin console, with CSV exports per month. Per‑PR attribution requires you to bridge ChatGPT user IDs to your git identities — usually via email.
- GitHub Copilot admin. Copilot Business and Enterprise expose per‑seat activity and acceptance rates, but spend per‑seat is functionally fixed by the plan, so the FinOps interest is lower. Copilot’s value is more in acceptance rate per dev than spend per dev.
Start with the native dashboards. They give you the per‑developer view in a day. They give you the per‑repo view if you scope API keys per repo. They do not give you the per‑PR view — that is where tagging comes in.
Tagging strategies (PR comment with cost, commit trailer with model)
Section titled “Tagging strategies (PR comment with cost, commit trailer with model)”The per‑PR view is the answer to “did this PR cost $5 or $500?”. To get it, you have to tag every AI‑driven action with enough context to attribute it back.
- Commit trailer with model. Many teams have started appending a
AI-Model:andAI-Cost-Estimate:trailer to commit messages, populated by a Claude Code or Codex hook. The trailer makes the cost attribution survive squash‑merge and shows up ingit log. This is the lightest possible tagging step and is reversible if you change your mind about format. - PR comment with cost. A GitHub Action runs on PR open/synchronize. It pulls the AI tool’s usage for the PR’s time window (matched by branch name → developer → timestamps), totals the cost, and posts it as a comment: “This PR consumed 432K input tokens, 14K output tokens, $4.21 across Sonnet 4.6 (88%) and Opus 4.7 (12%).” The comment is visible to reviewers, and the bot edits in place on each push so the running total stays current.
- PR labels for tiering. Auto‑label PRs by cost band:
cost:low(under $2),cost:medium($2–$20),cost:high($20+). Reviewers know without checking what they are looking at. Tie thecost:highlabel to extra review (see Q11 · AI‑PR labelling). - Branch‑name conventions. A naming convention like
<dev>/<repo>/<ticket>makes it trivial to bucket sessions back to dev + repo + ticket without API support. Many vendor admin APIs return the branch name in their per‑request logs, which is enough to reconstruct the attribution post hoc. - Tool selection annotation. A short pre‑prompt or hook records “this session is a refactor / new feature / code review / exploration”. You will be amazed how much of your spend turns out to be exploration — and how much exploration produces no commits.
Aggregators (Helicone, Langfuse, OpenRouter usage)
Section titled “Aggregators (Helicone, Langfuse, OpenRouter usage)”When you need a unified view across multiple vendors, aggregators sit between your engineers and the LLM APIs as a reverse proxy or SDK wrapper, capturing every request with metadata.
- Helicone. Open‑source observability layer for LLM calls. Sits on the request path or as a sidecar. Captures token counts, cost, model, latency, and arbitrary properties (user, repo, PR). Built‑in dashboards for spend by user, model, and custom property. Strong default for “I want one screen with every call from every dev across every model” without writing a warehouse pipeline.
- Langfuse. Open‑source tracing and prompt‑engineering platform with strong cost analytics. Adds trace‑level granularity — useful when an agentic session has dozens of internal LLM calls that should attribute to a single PR. Better than Helicone if you also want prompt versioning, evals, and offline experiments. Worse if you only need spend dashboards.
- OpenRouter usage. If you route through OpenRouter (a common pattern for multi‑model abstraction), it exposes per‑API‑key, per‑model, per‑request usage with metadata tags. Pair an OpenRouter key per developer or per repo with their
metadatafield for free per‑PR tagging — no proxy infra to operate. - Vendor‑agnostic gateways. Internal LLM gateways (Portkey, LiteLLM, custom) give you full control over tagging and rate limits but add an operational dependency. Pick this when you genuinely run >3 LLM vendors and need policy enforcement at the gateway; otherwise start with Helicone or OpenRouter.
Alerts and hard caps
Section titled “Alerts and hard caps”Visibility without alerts is just a wall of numbers no one reads. The alert system should be three‑tiered:
- Soft per‑developer alert. Triggered when a developer’s daily spend exceeds 2× their trailing 30‑day median. Posts in a private DM, not a public channel. Often false‑positive (someone is doing a hard refactor that day) but useful as a heads‑up.
- Per‑repo / per‑PR alert. A single PR exceeding a threshold ($25 is a common starting line) posts in the team channel with a link. Reviewers know to look harder. Sometimes the PR is legitimately a huge migration; sometimes it is an agent that looped on a flaky test for 4 hours.
- Org‑level hard cap. Daily and monthly spend ceilings, set well below the budget. At 80% of monthly budget, ops gets a warning. At 100%, default models downgrade automatically (Sonnet → Haiku, Opus → Sonnet) until budget resets. Hard cap is uncomfortable; it is also the only thing that stops a stuck background agent from burning a month’s budget overnight.
Cost‑per‑feature metric (link to specific PRs / tickets)
Section titled “Cost‑per‑feature metric (link to specific PRs / tickets)”The final layer turns spend into a unit‑economic metric. Most teams settle on one of:
- Cost per merged PR. Easiest to compute, hardest to game. Trended weekly, broken down by repo. Surfaces both inefficiency (low‑value repo, high cost) and good caching (refactor sweep dropped the average by 30%).
- Cost per shipped ticket. Group all PRs linked to a Linear/Jira ticket. Sum spend. Compare to the ticket’s “value” (often a T‑shirt size or business‑value tag). Catches the case where one ticket fragmented into 12 small PRs that together cost $400.
- Cost per accepted suggestion (Copilot‑style). Less useful for agentic tools where each “session” produces many edits, but still relevant for Copilot Workspace.
Pick one as the headline, the others as supporting. The headline metric goes on a wall. The supporting metrics get inspected when the headline moves.
Step‑by‑step: building cost visibility
Section titled “Step‑by‑step: building cost visibility”-
Inventory every vendor that bills tokens. List every paid AI tool the org touches: Anthropic, OpenAI, Cursor, Copilot, Replit, Windsurf, Hugging Face, Bedrock, Vertex. For each, note the billing entity, the monthly amount, the admin console URL, and who has admin access. Most orgs find one or two vendors they had forgotten — a stranded Replit team account is the canonical example.
-
Pull the last 90 days of per‑user data from each vendor’s admin API. Anthropic Admin Usage & Cost API, Cursor team export, OpenAI usage CSV, Copilot admin export. Drop it all into one place — a Google Sheet is fine for week one, a BigQuery / Snowflake table is right for production. The act of getting all this in one schema is half the work; do it before you pick a dashboard tool.
-
Build the per‑developer view first. Aggregate by user email, last 30 days, broken down by vendor and by model. Sort descending. Look at the top 10% and the bottom 10%. The top will usually be 5–10× the median; the bottom may be zero (people who are paying for a seat but not using it). Both findings are actionable on day one.
-
Layer in the per‑repo view via API‑key scoping. Issue distinct API keys (or distinct Cursor team slugs, or distinct Codex projects) per repository — at least for the top 10 repos by activity. From that day forward, the vendor’s per‑key usage = per‑repo usage. No new infra. If a vendor does not support multiple keys per team, scope by user → repo via the PR/commit reconciliation in step 5.
-
Tag PRs with cost at merge time. Add a GitHub Action that on every PR merge: (a) reads the PR’s branch + author + time window, (b) queries each vendor’s admin API for spend matching that window, (c) posts a comment with the totals, (d) writes a row to a
pr_coststable in your warehouse. Existing open‑source implementations: search for “claude‑code‑pr‑cost‑bot” and similar. Most teams write their own in a Saturday afternoon. -
Adopt a tagging convention for sessions. Decide on one branch‑name pattern (e.g.
<dev>/<area>/<short-desc>) and one commit trailer (e.g.AI-Model: claude-sonnet-4.6). Document it in CLAUDE.md / AGENTS.md / repo READMEs. Enforce via a hook so engineers cannot accidentally skip it. This is the single highest‑leverage step for per‑PR attribution and costs almost nothing. -
Wire alerts to Slack/Teams. Three channels of alert as described above: soft per‑dev (DM), per‑PR (team channel), org cap (ops channel). Use the vendor‑native budget alerts where they exist (Anthropic Admin, OpenAI org budgets) and a scheduled SQL query for the rest. Calibrate thresholds for two weeks — expect false positives — then tighten.
-
Stand up a dashboard view. Metabase, Hex, Looker, Grafana, or even a single Notion page — it does not matter much. What matters is that the per‑dev, per‑repo, per‑PR slices are one click apart and refreshed daily. Add a headline tile for cost per merged PR and trailing‑7 spend by repo. Make it visible to every engineer, not just leadership.
-
Install at least one aggregator. Pick Helicone or Langfuse, route at minimum your custom agent calls through it. You do not have to route Claude Code or Cursor through a proxy — those vendors run the inference — but anything you call directly (a custom agent, an evals pipeline, a vendor‑agnostic gateway) absolutely should be observed. Two weeks of data here will show you patterns the vendor dashboards cannot.
-
Set hard caps before you need them. Configure budget caps at the vendor level (Anthropic monthly budget, OpenAI org limit, Cursor team cap). Set them at 1.25× your current monthly spend so you have headroom but cannot accidentally 10× overnight. A stuck background agent can spend more in 12 hours than a senior engineer earns in a month — the hard cap is the last line of defence.
-
Calendar a monthly FinOps review. Engineering leadership, finance, and one or two senior engineers spend 30 minutes per month on the dashboard. Look at the top spenders, the top PRs, the cost‑per‑merged‑PR trend, and any alerts that fired. Decide one action item — a model‑routing change, a caching opportunity, a deprecated tool to remove. Action items are short; the discipline of doing them every month compounds.
Common pitfalls
Section titled “Common pitfalls”- Only invoice‑level view. You can name the vendor and the total, but not the developer, repo, or PR. This is where 80% of orgs are. It is the failure that makes every cost conversation a guess. Steps 2–4 above fix it in a week.
- No model tagging. Every call is “AI spend” with no per‑model breakdown. You cannot tell whether the Opus calls are 90% of cost (likely) or 30% of cost. You cannot run a model‑routing experiment because you cannot measure its impact. Cheap to fix; expensive to ignore.
- No alerts, only retrospective dashboards. Dashboards are reviewed once a quarter. Alerts are the only way runaway agents get caught the same day. If you do not have a Slack channel where AI cost alerts fire, you do not have cost visibility — you have cost archaeology.
- Per‑seat thinking instead of per‑outcome thinking. Spend per seat looks fixed (Cursor Pro = $20). Spend per outcome (cost per merged PR) is the metric that matters. Teams that fixate on seat licences miss the 10× variance hiding inside agentic usage.
- Tagging that engineers can skip. If your tagging convention relies on engineers remembering to set an env var or prefix a commit, it will drift in a month. Enforce via hook, CI check, or proxy layer. Drift is the silent killer of FinOps programs.
- Building a warehouse you cannot maintain. A perfect Snowflake schema that nobody owns will rot. A scrappy Google Sheet that gets refreshed every Monday will give you 80% of the value. Start scrappy; promote to warehouse only when the sheet hurts.
- Ignoring prompt caching impact. Anthropic’s prompt caching can drop input‑token cost by up to 90%. If your cost‑per‑PR is not falling as caching adoption rises, your dashboard is wrong or your engineers are not using caching — both worth a Tuesday afternoon.
- Confusing cost control with cost visibility. Visibility means you can answer the questions. Control means you act on the answers. Q4 is about visibility — but visibility without action is theatre. Make sure at least one action ships per monthly review.
How to verify you’re there
Section titled “How to verify you’re there”- You can answer “which 10% of engineers drive 50% of spend” in under five minutes on demand.
- You can answer “what did the checkout v2 epic cost in tokens” with a real number, not a range.
- Every merged PR has either a cost comment or a row in a queryable table; the most expensive PR last week is identifiable by name.
- A stuck background agent yesterday would have hit a per‑dev or per‑repo alert before $50 of spend, and a hard cap before $500.
- Your monthly FinOps review has a written list of action items from the previous review, and most of them have shipped.
- Cost per merged PR is visible to every engineer, trended over time, and has moved in the right direction at least once in the last quarter as a result of a deliberate change.
- Finance does not have to chase engineering for cost allocation — engineering proactively shares the breakdown each month.
- If your AI vendor disappeared tomorrow, you would know — to the dollar — what it had been costing per team, per repo, and per PR.