Fidelity
Does the artefact match the brief and references? At Stage 1, does the mockup reflect the references and brand guide? At Stage 4, does the code match the handoff bundle’s tokens and components?
The four-stage pipeline only works if each handoff is genuinely good. A weak artefact at Stage 1 doesn’t get rescued by a strong agent at Stage 4 — it gets faithfully translated into weak code. This page is the quality gate that runs at every handoff.
Errors compound through the pipeline. A 6/10 mockup yields a 6/10 prototype yields a 6/10 design system yields 6/10 code. The cost of fixing a problem grows at every stage:
Rating + iterating at the gate prevents the downstream rework. The point isn’t perfectionism — it’s that the model that just produced the artefact is the cheapest reviewer of it.
Apply these at every stage. The dimensions are constant; only the weights change.
Fidelity
Does the artefact match the brief and references? At Stage 1, does the mockup reflect the references and brand guide? At Stage 4, does the code match the handoff bundle’s tokens and components?
Coherence
Does the artefact hold together internally? Typography scales used consistently. Spacing rhythm. One voice in copy. State treatments that follow a rule, not vibes.
Completeness
Are edge cases, states, and copy filled in? No ???, no lorem ipsum, no “Feature 1 / Feature 2”, every component has every state from the matrix.
Craft
Is it polished enough that a stranger would assume a designer or senior engineer made it? The taste-level read.
Score each 0–10. The artefact’s score is the minimum of the four, not the mean — a 10/10 in three dimensions and a 4/10 in the fourth is a 4/10 artefact. The weakest dimension is what your downstream stages will inherit.
Run this in the same session that produced the artefact. The model that built it has the most context for reviewing it — and asking it to play hostile reviewer surfaces failure modes that praise-mode glosses over.
Replace {ARTEFACT} with the literal name: “the mockup”, “index.html”, “the design system handoff bundle”, “the tokens.css and component skeletons you just produced”.
The loop is the same everywhere; what differs is how each tool keeps the just-built artefact in scope so the critique reviews the real thing, not a summary of it.
Keep the artefact file open (e.g. index.html, tokens.css) and run the critique in agent-mode chat so it re-reads the open editor. Add the file with an @-mention if the agent drifts to other context. Use a checkpoint before the drive-to-10 follow-up so you can roll back a bad iteration.
Run the critique in the same terminal session that generated the artefact — the file is already in context and it will re-read it from disk before reviewing. Reference the path explicitly (Re-read tokens.css, then critique it) so it grades the saved file, not its own recollection.
Run on the same task/worktree that produced the artefact (App, IDE, or CLI) so the diff stays in scope. In the IDE surface, keep the file in the working set; in the App, continue the same task rather than starting a new one, so the artefact it grades is the one it just wrote.
The “stop and tell me what you need” clause is important. The model will otherwise keep iterating in circles when the real blocker is a decision you have to make (a brand call, a product trade-off, a missing requirement).
The dimensions are the same; the weights aren’t.
| Stage | Heaviest weights | Why |
|---|---|---|
| 1 — Mockup | Fidelity, Craft | The mockup sets the visual ceiling. Get the references reflected and the polish high; completeness will be filled in by Stage 2. |
| 2 — HTML prototype | Completeness, Coherence | This is where states get implemented. Every button needs every state. Coherence catches drift from the mockup. |
| 3 — Design system | Coherence, Completeness | A system is defined by its consistency across screens and states. Both dimensions matter equally. |
| 4 — Code + docs | Fidelity, Craft | Fidelity to the handoff bundle (no invented tokens). Craft in the code (matches existing conventions, no dead code, types are real). |
Don’t loop forever. Three reasons to ship at less than 10/10:
Each stage of the pipeline links here at its handoff. When in doubt, rate. The cost of running the loop once is a minute. The cost of compounding errors through four stages is a redo.