Rating & Iteration Loop: Drive Every Artefact to 10/10

The four-stage pipeline only works if each handoff is genuinely good. A weak artefact at Stage 1 doesn’t get rescued by a strong agent at Stage 4 — it gets faithfully translated into weak code. This page is the quality gate that runs at every handoff.

Why rate before handoff

Errors compound through the pipeline. A 6/10 mockup yields a 6/10 prototype yields a 6/10 design system yields 6/10 code. The cost of fixing a problem grows at every stage:

Fix a typography mistake at Stage 1: re-prompt, 30 seconds.
Fix it at Stage 2: re-prompt, regenerate the HTML, 2 minutes.
Fix it at Stage 3: re-extract tokens, regenerate the system across 6 screens, 10 minutes.
Fix it at Stage 4: track down every component using the wrong token, refactor, re-test, hours.

Rating + iterating at the gate prevents the downstream rework. The point isn’t perfectionism — it’s that the model that just produced the artefact is the cheapest reviewer of it.

The four rating dimensions

Apply these at every stage. The dimensions are constant; only the weights change.

Fidelity

Does the artefact match the brief and references? At Stage 1, does the mockup reflect the references and brand guide? At Stage 4, does the code match the handoff bundle’s tokens and components?

Coherence

Does the artefact hold together internally? Typography scales used consistently. Spacing rhythm. One voice in copy. State treatments that follow a rule, not vibes.

Completeness

Are edge cases, states, and copy filled in? No ???, no lorem ipsum, no “Feature 1 / Feature 2”, every component has every state from the matrix.

Craft

Is it polished enough that a stranger would assume a designer or senior engineer made it? The taste-level read.

Score each 0–10. The artefact’s score is the minimum of the four, not the mean — a 10/10 in three dimensions and a 4/10 in the fourth is a 4/10 artefact. The weakest dimension is what your downstream stages will inherit.

The self-critique prompt

Run this in the same ChatGPT or Claude thread that produced the artefact. The model that built it has the most context for reviewing it — and asking it to play hostile reviewer surfaces failure modes that praise-mode glosses over.

You produced {ARTEFACT}. Now act as a hostile senior reviewer.
Rate it 0-10 on Fidelity, Coherence, Completeness, and Craft.
For every dimension scored below 10, list every specific issue
that is keeping it from 10. Cite concrete locations
(component name, region of the mockup, line of code).
Do not be polite. Do not summarize. Do not say "overall it's good".

Replace {ARTEFACT} with the literal name: “the mockup”, “index.html”, “the design system handoff bundle”, “the tokens.css and component skeletons you just produced”.

The “drive to 10/10” follow-up

Now fix every issue you listed. Produce the revised {ARTEFACT}.
Then re-rate it on the same four dimensions and list any
remaining gaps with concrete locations. Repeat until all four
dimensions are 10/10, or you hit a constraint you can't resolve
without me — in which case stop and tell me exactly what you need.

The “stop and tell me what you need” clause is important. The model will otherwise keep iterating in circles when the real blocker is a decision you have to make (a brand call, a product trade-off, a missing requirement).

Stage-specific rating focus

The dimensions are the same; the weights aren’t.

Stage	Heaviest weights	Why
1 — Mockup	Fidelity, Craft	The mockup sets the visual ceiling. Get the references reflected and the polish high; completeness will be filled in by Stage 2.
2 — HTML prototype	Completeness, Coherence	This is where states get implemented. Every button needs every state. Coherence catches drift from the mockup.
3 — Design system	Coherence, Completeness	A system is defined by its consistency across screens and states. Both dimensions matter equally.
4 — Code + docs	Fidelity, Craft	Fidelity to the handoff bundle (no invented tokens). Craft in the code (matches existing conventions, no dead code, types are real).

Stop criteria — when to exit short of 10/10

Don’t loop forever. Three reasons to ship at less than 10/10:

A constraint that needs a human decision. Brand, legal, product trade-off, or a missing piece of context only you have. Stop and decide.
Two iterations in a row with no improvement. That’s not a polish problem, it’s an architectural one — the artefact is the wrong shape, and you need to revisit the prompt or the previous stage’s output rather than keep iterating on this one.
You’re at 9/10+ and the remaining gap is taste, not quality. Ship. The downstream stages will surface anything that actually matters.

Anti-patterns

Don’t trust the model’s self-rating without spot-checking with your eyes. The hostile-reviewer framing helps, but the model can still miss things a designer would catch in three seconds. Open the artefact and look at it.
Don’t rate before the artefact is meaningfully complete. Critiquing a half-finished mockup is theatre — the model lists issues that are about to be fixed by the next iteration anyway. Wait for the first complete pass, then rate.
Don’t re-run the loop on Stage N+1’s artefact and call it a fix for Stage N’s artefact. If the prototype has the wrong typography, the fix is to re-rate the mockup, not to patch the prototype. Always fix at the earliest stage where the problem appears.
Don’t skip the loop when you’re tired. This is when you most need it. The whole point of an explicit gate is that it doesn’t depend on your judgement at the moment of handoff.

Back to the pipeline

Each stage of the pipeline links here at its handoff. When in doubt, rate. The cost of running the loop once is a minute. The cost of compounding errors through four stages is a redo.

Return to the Four-Stage Pipeline →