Fidelity
Does the artefact match the brief and references? At Stage 1, does the mockup reflect the references and brand guide? At Stage 4, does the code match the handoff bundle’s tokens and components?
The four-stage pipeline only works if each handoff is genuinely good. A weak artefact at Stage 1 doesn’t get rescued by a strong agent at Stage 4 — it gets faithfully translated into weak code. This page is the quality gate that runs at every handoff.
Errors compound through the pipeline. A 6/10 mockup yields a 6/10 prototype yields a 6/10 design system yields 6/10 code. The cost of fixing a problem grows at every stage:
Rating + iterating at the gate prevents the downstream rework. The point isn’t perfectionism — it’s that the model that just produced the artefact is the cheapest reviewer of it.
Apply these at every stage. The dimensions are constant; only the weights change.
Fidelity
Does the artefact match the brief and references? At Stage 1, does the mockup reflect the references and brand guide? At Stage 4, does the code match the handoff bundle’s tokens and components?
Coherence
Does the artefact hold together internally? Typography scales used consistently. Spacing rhythm. One voice in copy. State treatments that follow a rule, not vibes.
Completeness
Are edge cases, states, and copy filled in? No ???, no lorem ipsum, no “Feature 1 / Feature 2”, every component has every state from the matrix.
Craft
Is it polished enough that a stranger would assume a designer or senior engineer made it? The taste-level read.
Score each 0–10. The artefact’s score is the minimum of the four, not the mean — a 10/10 in three dimensions and a 4/10 in the fourth is a 4/10 artefact. The weakest dimension is what your downstream stages will inherit.
Run this in the same ChatGPT or Claude thread that produced the artefact. The model that built it has the most context for reviewing it — and asking it to play hostile reviewer surfaces failure modes that praise-mode glosses over.
You produced {ARTEFACT}. Now act as a hostile senior reviewer.Rate it 0-10 on Fidelity, Coherence, Completeness, and Craft.For every dimension scored below 10, list every specific issuethat is keeping it from 10. Cite concrete locations(component name, region of the mockup, line of code).Do not be polite. Do not summarize. Do not say "overall it's good".Replace {ARTEFACT} with the literal name: “the mockup”, “index.html”, “the design system handoff bundle”, “the tokens.css and component skeletons you just produced”.
Now fix every issue you listed. Produce the revised {ARTEFACT}.Then re-rate it on the same four dimensions and list anyremaining gaps with concrete locations. Repeat until all fourdimensions are 10/10, or you hit a constraint you can't resolvewithout me — in which case stop and tell me exactly what you need.The “stop and tell me what you need” clause is important. The model will otherwise keep iterating in circles when the real blocker is a decision you have to make (a brand call, a product trade-off, a missing requirement).
The dimensions are the same; the weights aren’t.
| Stage | Heaviest weights | Why |
|---|---|---|
| 1 — Mockup | Fidelity, Craft | The mockup sets the visual ceiling. Get the references reflected and the polish high; completeness will be filled in by Stage 2. |
| 2 — HTML prototype | Completeness, Coherence | This is where states get implemented. Every button needs every state. Coherence catches drift from the mockup. |
| 3 — Design system | Coherence, Completeness | A system is defined by its consistency across screens and states. Both dimensions matter equally. |
| 4 — Code + docs | Fidelity, Craft | Fidelity to the handoff bundle (no invented tokens). Craft in the code (matches existing conventions, no dead code, types are real). |
Don’t loop forever. Three reasons to ship at less than 10/10:
Each stage of the pipeline links here at its handoff. When in doubt, rate. The cost of running the loop once is a minute. The cost of compounding errors through four stages is a redo.