AI metrics panel — 6 numbers every CTO should track

Q22 · Strategy & ROI What do you measure for AI tooling? (multi-select — 1 pt each, max 3)

Max-score answer (3 pts): Spend ($/dev/month) and throughput (PR/dev/week) and quality (bug regression rate, revert rate) and adoption (active-user %, average session count) and review-to-merge time and cost-per-feature (token cost per feature ticket) — all on a single panel.

Why it matters: You can’t improve what you don’t measure. The set above is the 2026 minimum panel — most CTOs track 1–2 and miss the leverage in the rest.

Why this matters in 2026

The 2025 narrative — “AI tooling makes developers faster, ship more, here’s the seat count, here’s the bill” — is over. Through 2026 the question every board, finance team, and CFO is asking has tightened: which of those statements is actually true on this engineering org, in numbers, this quarter? The default answer most CTOs can give — “throughput is up, vibes are good, here’s the trailing 30-day spend” — does not survive contact with a serious ROI review. It also does not survive contact with the operational reality of the tools themselves: AI-co-authored PRs ship roughly 1.7× more issues than human-only PRs, agentic sessions burn 3–10× more tokens than chatbot sessions, and the variance between your best and worst engineer on AI tooling is now larger than the variance between your best and worst engineer overall.

The single most measurable industry shift between 2024 and 2026 is that teams that measure win, and teams that don’t measure get out-shipped by teams half their size. Larridin’s 2026 Developer Productivity Benchmarks report puts elite AI-native teams at 80%+ weekly active usage, 60–75% AI-assisted code share, and sub-8-hour PR cycle times — numbers that are not opinions, they are dashboards. Exceeds.ai’s 2026 AI Development Benchmarking guide for CTOs frames the same idea differently: “A good benchmark in 2026 measures at least three of five dimensions: adoption, AI code share, complexity-adjusted velocity, code quality, and ROI.” When AI adoption rises from 0% to 100% on a team, 2026 productivity analyses (Larridin’s Developer Productivity Benchmarks 2026 among them) report median cycle time dropping roughly 24% (from ~16.7 to ~12.7 hours) and average PRs per engineer climbing around 113% — while the bug rate rises about 9%. Treat the exact figures as directional rather than a single authoritative dataset; the point is that you cannot decide which side of that trade your team is on without instrumentation. You are flying on hope.

The 2026 minimum panel is six numbers, on one screen, refreshed at least weekly: spend per developer per month, throughput (PR/dev/week), quality (bug regression rate, revert rate), adoption (active-user %, average session count), review-to-merge time, and cost-per-feature (token cost per feature ticket). Most CTOs in 2026 measure one or two of these — typically spend and seat-level adoption — and miss the leverage in the rest. The CTO Scorecard caps Q22 at 3 points precisely because each metric independently moves a real decision, and the panel only makes sense when you can see all six trends at once: when spend rises but throughput doesn’t, when adoption is flat but quality slipped, when review-to-merge time looks great but cost-per-feature blew out because everyone is using Opus on everything.

What “max score” actually looks like (all 6 metrics on a single dashboard)

A max-score Q22 dashboard is one page. Six tiles, six trends, one row of headline numbers across the top. Engineering leadership opens it on Monday morning, the room knows within 90 seconds whether last week was a good week. Each tile drills down. Each metric has an owner.

Spend ($/dev/month). Trailing 30-day token spend per active developer, broken down by vendor (Anthropic, OpenAI, Cursor, Copilot) and by model (Opus, Sonnet, Haiku, GPT-5, Gemini). A median, a 90th percentile, and a top-five list. The number you put on the wall is the median per active developer per month — outliers are interesting, but the median is what you steer by. Calibrate to the 2026 range: the published industry data on power users sits at $50–$200/dev/month for Claude Code and Cursor combined; if your median is $10 you have an adoption problem, if it is $400 you have a routing problem.
Throughput (PR/dev/week). Merged PRs per active developer per week, with a 4-week rolling average. Critically, you also split this by AI-authored vs human-only (using the ai-authored label from Q11). Pure PR count is a vanity metric; PR count split by authorship is the leading indicator of whether AI is actually shipping work or just generating noise. Pair it with a complexity-adjusted view (see “gotchas” below) so a refactor PR doesn’t count the same as a typo fix.
Quality (bug regression rate, revert rate). Two numbers. Bug regression rate = bugs filed within 14 days of merge / PRs merged. Revert rate = reverts / PRs merged. Both trended weekly, both split AI-authored vs human-only. The published 2026 number is roughly 9% higher bug rate on AI-authored code; you want to see your number sitting at or below the industry baseline, not above.
Adoption (active-user %, average session count). Weekly active users as a % of seats licensed, plus average sessions per active user per week. The first number tells you whether you are paying for shelfware; the second tells you whether the people who do use the tools are using them as a daily habit or a once-a-week curiosity. Elite teams sit at 80%+ WAU and 8–15 sessions per user per week.
Review-to-merge time. Median time from PR opened to PR merged, in hours. Same split — AI-authored vs human-only. This is the cleanest single proxy for “is the AI actually accelerating shipping, or just generating PRs that pile up in review?” Elite teams sit under 8 hours; struggling teams sit at 30–48 hours and don’t know why.
Cost-per-feature (token cost per feature ticket). The most ambitious of the six and the one that pulls the whole panel together. Every Linear/Jira feature ticket gets a total AI cost — sum of token spend across every PR linked to that ticket. Trended over time, sliced by team and by feature type. This is the number that answers the head-of-product question “what did the checkout v2 epic cost us in AI?” with a real answer, not a shrug.

On top of the six tiles, the panel carries two contextual labels. Baseline: the same six numbers from 90 days ago, so every trend is visible at a glance. Owner: the named human who is on the hook for moving the number this quarter. A dashboard without baselines is decoration. A dashboard without owners is wallpaper.

Current landscape (web-search-verified)

Spend ($/dev/month) — where to pull from

Spend per developer per month is the foundational number. It is also the easiest to mis-measure because most CTOs accept the vendor invoice as the answer and stop there.

Anthropic Console Admin Usage & Cost API (mid-2025) plus the Enterprise Analytics API (2026) give per-user, per-day, per-model spend with up to 90 days of history. For Claude Code subscribers, the API also returns commits, pull requests, and lines of code per user — which is half of your throughput dashboard for free.
Cursor admin dashboard exposes per-member credit consumption per model, with CSV/REST export for Team and Business plans. Cursor’s 2025 move to usage-based credits means every action carries a credit cost the admin can see; the credit-to-dollar conversion is on the billing page.
OpenAI usage exports are per-user via project-scoped API keys (the post-2026 model). The admin console exports CSVs; cross-reference user IDs against your git identities by email.
GitHub Copilot admin shows per-seat activity and acceptance rates, but spend per seat is functionally fixed by the plan, so the FinOps interest is in acceptance rate per dev more than spend per dev.

Spend goes on the dashboard as median $/active-dev/month with top decile and bottom decile annotations. Both extremes are signal — top spenders often correlate with top throughput, bottom spenders are usually paying for an unused seat. Q4 (Cost visibility) is the deeper version of this metric; Q22 just demands the headline.

Throughput (PR/dev/week) — gotchas around AI authorship

Throughput looks simple — count merged PRs, divide by active developers, divide by weeks — and is the most commonly mis-measured metric on the panel.

Vanity-PR risk. Without an authorship split, a team can look like it 2× throughput by generating low-value PRs that nobody would have merged on a careful read. The split between AI-authored and human-only is the discipline that prevents this. Use the ai-authored label from Q11 (or commit trailers) to bucket every PR.
Complexity-adjusted throughput (CAT). Industry benchmarks in 2026 increasingly use a complexity-weighted PR count: Easy = 1 point, Medium = 3, Hard = 8. Larridin and Exceeds.ai both publish using CAT. You can approximate CAT cheaply with diff size and file count (Easy = <50 lines / 1–2 files, Medium = 50–300 lines / 3–8 files, Hard = >300 lines / 9+ files), or rigorously with manual tagging on a sample. Either is better than raw PR count.
Active-developer denominator. A developer who merged zero PRs this week is still in the denominator if they’re paying for a seat. Don’t drop them — that hides the adoption problem.
Reverts and superseded PRs. Subtract these from the numerator. A PR that gets reverted within 14 days didn’t ship value; counting it inflates the headline.

When you see published numbers like “average PRs per engineer climbed 113% as AI adoption rose from 0% to 100%”, they are almost certainly raw PR count. The honest number is closer to 30–60% on complexity-adjusted throughput — still significant, but not the headline figure.

Quality (bug regression rate, revert rate)

Quality is the metric that prevents throughput from becoming a lie. Two numbers, weekly:

Bug regression rate. Bugs filed within 14 days of merge / PRs merged in that period. The 14-day window is a published industry convention — long enough to catch the bugs caused by the change, short enough to attribute them to it.
Revert rate. PRs reverted within 14 days / PRs merged. A clean signal because reverts are unambiguous.

Both split by authorship. The 2026 published baseline for AI-authored code is about 9% higher bug rate and a similar elevation on reverts, with security-shaped issues running materially higher: Veracode’s 2025 GenAI Code Security Report found AI-generated code failed security checks in roughly 45% of tasks (XSS/CWE-80 in about 86% of cases), and CodeRabbit’s December 2025 State of AI vs Human Code Generation put cross-site-scripting findings at roughly 2.74× human-authored. The number you want on your dashboard is whether your team sits at, below, or above those industry baselines — and which direction the trend is moving as you tighten review gates (Q9, Q10, Q11).

Adoption (active-user %, session count)

Adoption is the first metric to drift if no one is watching, because seats look “deployed” the moment they are licensed.

Weekly active users as % of seats licensed. Pulled from the vendor admin API. Anthropic, Cursor, OpenAI, and Copilot all expose per-seat activity. Elite teams sit at 80%+ WAU; the median 2026 team sits at 50–65%; teams below 40% are paying for shelfware and don’t know it.
Average sessions per active user per week. A “session” is vendor-defined but the proxy is good enough. Daily habit looks like 8–15 sessions/week; once-a-week curiosity looks like 1–2. The gap between those two regimes is where the actual productivity gains live.
Power-user concentration. A useful supporting metric: what % of total usage comes from the top 20% of users? Healthy teams sit at 40–60% (the Pareto pattern); unhealthy teams sit at 80–90% (one champion is doing all the work and the rest are quiet).

Cross-reference adoption with Q1 (Team adoption rate). Q22 just needs the measurement live on a dashboard.

Review-to-merge time

Review-to-merge time is the single cleanest indicator of whether AI is actually accelerating shipping or just generating PRs that pile up.

Definition. Median time from PR opened to PR merged, in hours, last 7 and 30 days. Split by authorship.
Source. GitHub API (pull_requests.merged_at - created_at) or whatever ships with your DevEx platform (Jellyfish, LinearB, DX, Faros, Swarmia).
2026 benchmarks. Elite teams under 8 hours, healthy teams 8–24 hours, struggling teams 30–48+ hours. The published industry data shows median cycle time drops 24% as AI adoption rises — but only if review capacity scales to match. If your throughput climbs and your review-to-merge time climbs with it, you have a review bottleneck (Q9, Q10) hiding behind your throughput win.
The “ghost-PR” trap. Some teams report review-to-merge time across all PRs including auto-merge bot PRs (Renovate, Dependabot, etc.). Exclude those — they bias the median downward and hide real problems.

Cost-per-feature (link Linear/Jira tickets to PR token spend)

Cost-per-feature is the most ambitious metric on the panel and the one that turns spend into a unit-economic answer. It is also the metric that demands the most plumbing, because it requires you to bridge three systems: your AI cost data (per-PR), your VCS (PRs → tickets), and your tracker (tickets → features).

Linkage rule. A ticket’s AI cost = sum of token cost of every PR linked to that ticket. PR-to-ticket links come from branch names (mjas/PROJ-1234/checkout-v2), commit messages, or PR descriptions. Most teams enforce one of these conventions in CI.
Granularity. The useful unit is the feature ticket (an epic or a top-level user-story ticket), not the implementation sub-task. Aggregate up the parent-child relationship in Linear/Jira.
What it reveals. The published 2026 examples are striking: a checkout-v2 epic at one Series-B fintech was published at $4,200 in AI spend across 47 PRs over six weeks — about $90/PR average, but with a $300 outlier PR that turned out to be an agent loop on a flaky test. Without the cost-per-feature view that $300 PR is invisible. With it, the outlier is the headline.
Tooling. CloudZero, Vantage, Holori, and Finout all market AI cost observability in 2026 with feature-level allocation. The Nvidia 2026 blog post Rethinking AI TCO frames cost-per-token as the underlying unit but argues — correctly — that the metric that matters at the executive layer is cost per output the business cares about, which for engineering is cost-per-feature. Gartner’s 2026 framing of the same problem notes that agentic models require 5–30× more tokens than standard chatbots, which makes per-feature cost-tracking the only way to keep agentic adoption sustainable.

Pair cost-per-feature with the value tag on the ticket (T-shirt size, business-value rating, or a manual judgment from the PM). The pair, not the cost alone, is what you steer by.

Step-by-step: rolling out the metrics panel

Decide who owns the panel. One named human — typically the VP Engineering or head of DevEx — owns the dashboard, the data quality, and the weekly review. Without a named owner, the panel rots in a quarter. Their job is not to build it; their job is to keep it honest and visible.
Inventory your data sources. For each of the six metrics, write down the source (Anthropic Admin API, Cursor admin export, GitHub API, Linear API, etc.) and the access credentials. Most teams discover at this step that they don’t have admin access to one of the vendors — fix that before you go further. Q3 (Team billing) and Q4 (Cost visibility) are the precursors; if you scored low there, address those first.
Pull a 90-day backfill for all six. A week of data is not enough — every metric needs at least 90 days to show a trend, and ideally 180. Pull spend, throughput, quality, adoption, review-to-merge, and cost-per-feature into a single table (one row per developer per week is the natural grain). A scratch BigQuery / Snowflake / DuckDB instance is fine; a Google Sheet is fine for week one if it gets you moving.
Build the authorship split. This is the single highest-leverage data-modeling step. For every merged PR in the backfill, tag it as ai-authored or human-only using your label (Q11), commit trailer convention, or — failing both — a heuristic on PR body / branch name. Without the split, throughput and quality are vanity metrics. With it, they are decisions.
Define the active-developer denominator. “Active developer” = anyone who merged at least one PR in the trailing 14 days. Use this as the denominator for spend per dev, throughput per dev, and active-user %. Lock it in writing so the metric doesn’t drift quarter to quarter.
Stand up the six tiles on one page. Metabase, Hex, Looker, Grafana, Mode, even Notion embeds — the tool doesn’t matter. What matters is that the six tiles are on one page, refreshed at least weekly, with 90-day trend lines. Add a row of headline numbers at the top: today’s value and 90-day delta for each. Make the page bookmarkable.
Layer in cost-per-feature. The hardest of the six. Build the join from PR token spend (Q4 pipeline) → PR → ticket (via branch naming convention) → feature/epic (via Linear/Jira parent links). Start with the top 20 epics by PR count; expand once the join works.
Annotate baselines and owners on each tile. Each tile shows: today’s number, 90-day baseline, direction of travel, and the named owner who is on the hook. The annotation is what turns a chart into accountability.
Calendar a weekly 20-minute review. Engineering leadership + the panel owner. Open the dashboard. Look at each tile. Decide one action item from the panel for the coming week. Action items are short; the discipline of doing them every week compounds.
Re-share the panel quarterly with finance and product. The whole point of measuring is that you can have the conversation with finance about ROI (Q23) and with product about cost-per-feature without flinching. Once a quarter, walk the panel through a CFO/CPO review. The exercise will teach you which metrics you actually trust and which ones you’ve been hiding behind.
Add complexity-adjusted throughput in quarter two. Once the basic panel is stable, layer in CAT. Either approximate from diff size + file count, or do manual tagging on a sample (50–100 PRs per quarter is plenty). The shift from raw PR count to CAT usually reveals that the throughput “gain” from AI is smaller than the dashboard implied — and that is exactly the kind of finding the dashboard exists to surface.
Sunset metrics you stop acting on. If a tile hasn’t driven a decision in a quarter, kill it or replace it. The panel is six numbers for a reason; the discipline of keeping it to six numbers is most of the value.

Common pitfalls

Vanity metrics with no baseline. “Throughput is up 30%” against what? Without a 90-day baseline on every tile, the dashboard is decoration. Bake the baseline into the tile, not into a footnote.
Tracking only one or two metrics. Most 2026 CTOs track spend and maybe adoption, and stop. The leverage is in the cross-tile reads — spend up + throughput flat = routing problem; throughput up + quality down = review-gate problem; adoption up + cost-per-feature up = model-choice problem. You can’t do those reads with two tiles.
No authorship split. Without splitting AI-authored vs human-only PRs, throughput and quality are uninterpretable. The 9% bug-rate uplift and CodeRabbit’s 2.74× cross-site-scripting multiplier on AI-authored code are averages — your team is somewhere on a distribution and you can’t tell where without the split.
No active-developer denominator. Using “total seats” instead of “active developers” hides the adoption problem. Using “merged at least one PR in 30 days” hides everyone who tried the tools for a week and stopped. Pick a tight definition (14 days, one PR) and keep it.
No review cadence. A dashboard that exists but isn’t reviewed is the most expensive single artifact in the eng org — it took weeks to build and it changes no decisions. The 20-minute weekly review is non-negotiable. If leadership can’t make the meeting, the dashboard isn’t useful.
Pure cost-cutting framing. Q22 isn’t about driving spend down. It’s about making spend legible relative to outcomes. A team whose spend doubles and whose cost-per-feature halves is winning. A team whose spend stays flat and whose throughput stays flat is also losing — they’re just losing more slowly.
Confusing acceptance rate with adoption. Copilot reports an acceptance rate that some teams put on dashboards as “AI productivity.” Acceptance rate measures how often suggestions are taken; it does not measure whether they shipped value. Don’t substitute it for the panel; pair it with throughput and quality.
Building the panel before the underlying instrumentation. Q22 sits on top of Q3, Q4, Q9, Q11. If you haven’t done team billing, cost visibility, layered PR review, or AI-PR labelling, the panel will be hollow. Sequence: fix the precursors, then build the panel.
Confidence intervals you don’t acknowledge. With small teams (< 20 developers) weekly numbers are noisy. Trend on 4-week rolling averages and put confidence bands on the tiles; otherwise the eng-leadership room will overreact to a single noisy week.

How to verify you’re there

All six metrics — spend, throughput, quality, adoption, review-to-merge, cost-per-feature — are visible on one page, refreshed at least weekly, with 90-day baselines on every tile.
Every tile has a named human on the hook to move the number this quarter.
Throughput and quality are split AI-authored vs human-only, and you can quote both numbers from memory.
A weekly 20-minute review is on the calendar, attended by engineering leadership, with at least one action item from the most recent two weeks shipped.
The most expensive feature ticket of the last quarter is identifiable by name, with a real dollar number attached, not a range.
When the CFO asks “what is AI tooling costing us per shipped feature?”, you can answer in under two minutes using the panel.
When the head of product asks “is AI making the team faster?”, you can answer with the throughput and review-to-merge tiles, split by authorship, instead of a vibes-based reply.
At least one of your six metrics has materially moved in the right direction in the last 90 days as a result of a deliberate change documented in the weekly review.
A new engineering hire can read the panel in five minutes and tell you which way the team is trending on AI tooling.
The panel survived a CFO review without you walking back any numbers — meaning the data quality is high enough that you trust your own dashboard.