It's become an article of faith that AI coding tools make developers dramatically faster. Then METR ran a proper randomised controlled trial — and the result surprised everyone, including the developers in it.
The study
METR's July 2025 paper, *Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity*, followed 16 experienced developers through 246 real tasks on repositories they maintain (~1M lines of code). Tasks were randomly assigned AI / no-AI — the same design used in drug trials.
The design matters because it's the part most "AI made us X% faster" claims skip. These weren't toy problems or unfamiliar codebases; they were real issues in repositories the developers knew intimately. Randomly assigning AI or no-AI to each task removes the usual confound — that people reach for AI on the easy tasks and grind through the hard ones by hand — and lets you actually attribute the difference to the tool.
The result
- Allowing AI increased completion time by 19%.
- Developers believed AI made them 20% faster — a 39-point perception gap.
- Economists and ML experts had predicted a ~38–39% speed-up. Everyone was wrong in the same direction.
AI slowed people down while making them feel faster. Perception is not a measure of productivity.
Why a slowdown, and why it felt fast
The result is less paradoxical than it sounds. On code you know deeply, the bottleneck isn't typing — it's understanding. The AI produces plausible code quickly, but then you have to read it, check it against context the model didn't have, and fix the parts that are subtly wrong. That review-and-repair loop can cost more than just writing the change yourself. Meanwhile it feels faster because the screen fills with code almost instantly; the effort moves from "producing" to "verifying," and verifying is quieter work that doesn't register as labour the same way. That's the perception gap in a sentence: output volume is not the same as progress.
There's a second mechanism worth naming. AI lowers the activation energy of starting, which feels great — a blank function gets a body in seconds. But on mature code the hard part was never starting; it was getting the last 20% exactly right against constraints the codebase already encodes. The model is fast at the easy part and unreliable at the hard part, so it front-loads the satisfying work and back-loads the expensive work. You end the task tired from reviewing rather than writing, and your memory of "how it went" is dominated by that brisk, productive-feeling start.
It's worth being precise about scope. This is one study, on experts working in code they know deeply — close to the worst case for AI assistance. It does not show AI never helps. For unfamiliar languages, boilerplate, exploratory prototyping or developers who are new to a codebase, the picture may look very different. What it punctures is the assumption that the speed-up is automatic, large and self-evident.
It also lands inside a broader pattern. MIT found 95% of GenAI pilots deliver no measurable P&L impact — another case of confident expectation meeting unsentimental measurement. The throughline across both is that AI's felt value and its measured value can diverge sharply, and only one of them shows up in delivery dates or the income statement.
What this means for your team
- Don't trust the vibe. If you justify AI tooling on productivity, measure it — the people using it are not reliable narrators of their own speed.
- Run a cheap version of METR's design. Split similar tasks AI / no-AI across a sprint or two and compare real completion times, not survey sentiment.
- Target where AI plausibly helps. Onboarding to unfamiliar code, scaffolding, test generation — not deep edits to systems your seniors already hold in their heads.
- Separate satisfaction from throughput. Developers can genuinely enjoy the tool while it slows them down; both can be true, and only one shows up in delivery.
The broader lesson generalises well beyond code: feelings about AI productivity are not evidence of it. Building a habit of measuring AI ROI against a real baseline is the only way to know which side of this study your team is on. If you want help setting that measurement up, our team does this kind of work.
A note on what this is not
It would be easy to weaponise this study into "AI tools are a waste of money," and that would be just as unevidenced as the hype it corrects. The honest reading is narrower and more useful: the productivity benefit is real in some contexts, absent or negative in others, and — critically — invisible to the person experiencing it. That means the worst possible way to decide whether to adopt AI tooling is to ask your developers how it feels. They will tell you it's faster, sincerely, and they may be wrong. The right way is to measure delivery against a baseline, accept that the answer might differ by task type and seniority, and let the data decide where the tool earns its place. Treating a single RCT as gospel is a mistake; treating developer enthusiasm as a metric is a bigger one.