MIT found 95% of GenAI pilots deliver no measurable impact. METR found developers can't even tell when AI slows them down. Both point to the same missing discipline: evaluation. The teams that ship AI into production and the teams whose pilots quietly die differ less in their models than in whether they can answer one question with a number: is this actually working?
Why evals matter more for AI
Traditional code is deterministic — a test passes or fails, and the same input always produces the same output. AI output is probabilistic; the same prompt can give different answers, and "is it good enough?" becomes an argument unless you make it a number. Without evals, you are flying on vibes. You can't tell if a prompt change helped or just felt better in the three examples you happened to try. You can't tell if last week's model upgrade quietly regressed an important case. You can't tell if quality is drifting as real inputs diverge from what you tested. Every one of those is a silent way for a promising pilot to rot.
Building a harness
- 1.Curate a golden set of representative inputs with known-good outputs (and known-hard edge cases). Start small — even 50 well-chosen examples beats none — and grow it every time you find a new failure.
- 2.Pick metrics that match the task — exact match for structured extraction, rubric scoring for open-ended text, an LLM-as-judge with a clear rubric for nuanced quality, or human review for the genuinely ambiguous slice.
- 3.Set a target before you build. "90% on the golden set" turns opinion into a finish line, and stops the goalposts from moving once you're emotionally invested in shipping.
- 4.Run evals in CI so every prompt or model change is scored automatically, before it reaches users.
Evals are to AI features what tests are to software. Shipping without them is shipping blind — and blind is how pilots become the 95%.
Feed real failures back in
The golden set is not a one-time artefact. The most valuable examples come from production: the queries that embarrassed you, the edge cases you didn't anticipate, the complaints from users. Every time something goes wrong, capture the input and add it to the set. Over months this turns your eval suite into an institutional memory of every way your feature can fail — and a guarantee that you'll never ship the same regression twice. It also pairs naturally with logging inputs and outputs so failures can actually be reconstructed.
A note on LLM-as-judge
For open-ended output, the most practical metric is often another model scoring the answer against a rubric. It scales where human review can't, and it's far more consistent than eyeballing a handful of examples. But it has to be done carefully: a vague instruction like "rate this 1 to 10" produces noise. Give the judge a specific rubric — what counts as correct, what counts as a serious error, what to ignore — and spot-check its scores against human judgement periodically to make sure the two haven't diverged. Treat the judge as another component that itself needs evaluating, not as an oracle. Used well, it lets a small team measure quality across thousands of cases that would otherwise be unmeasurable.
Why this is the difference between the 5% and the 95%
The reason evaluation correlates so strongly with pilots that survive isn't mystical. A team with evals can iterate with confidence: they change a prompt, the score moves, they keep what works. A team without them iterates on vibes, plateaus, and loses the stakeholders' patience before the feature is good enough to matter. Evals turn a fuzzy "the AI thing kind of works" into a number a sceptical executive can trust — and a number trending in the right direction is what keeps a project funded. That, more than any model choice, is what separates the 95% of pilots that stall from the few that reach production.
What this means for a team
Evaluation is the cheapest insurance you can buy against joining the 95%. It is what lets you change models without fear, lets you justify the AI feature to a sceptical stakeholder with data instead of anecdotes, and lets you tell the difference between "the model got better" and "we got lucky." Budget for it from day one rather than bolting it on after the pilot stalls. If you want help standing up evaluation for an AI feature, we do this.
Sources
- MIT NANDA — The GenAI Divide
- METR — Developer productivity RCT