One of the most discussed studies of 2025 came from METR, a research nonprofit that studies AI capabilities. Their paper, *Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity*, produced a counterintuitive result worth understanding carefully — because it's easy to weaponise in either direction.
(Our plain-language summary; the paper is linked so you can read the source.)
The backdrop
By early 2025, the prevailing story about AI coding assistants was one of obvious, large productivity gains. Survey after survey reported developers feeling dramatically faster, and vendor case studies quoted big percentage improvements. But almost all of that evidence was self-reported — people saying how much faster they felt — rather than measured. METR set out to do something the hype cycle had mostly skipped: actually time the work.
How the study worked
Crucially, this wasn't another survey. It was a randomised controlled trial — the same gold-standard design used to test medicines. Experienced open-source developers worked on real, substantial tasks in codebases they already knew deeply, often projects they personally maintained. For each task, AI assistance was randomly allowed or not allowed, and the actual time taken to complete the work was recorded. Randomisation is what makes the result credible: it isn't comparing different people or cherry-picked tasks, it's the same skilled developers doing comparable work with and without the tools.
Why does that design matter so much? Because the usual way these claims get made is hopelessly biased. If you ask people whether a tool helped, they remember the moments it dazzled them and forget the quiet minutes spent untangling its mistakes. A controlled trial sidesteps memory and impression entirely — it just measures the clock. That's also what makes the result hard to wave away: there's no "but they weren't using it properly" escape hatch, because these were skilled developers using current tools on their own code.
The surprising result
When allowed to use AI tools, developers took about 19% longer to finish their tasks. Read that twice — longer, not shorter. And here's the twist that gave the study its punch: those same developers believed the AI had made them roughly 20% faster. Even after the fact, having lived through both conditions, they misjudged the direction of the effect. That's a nearly 40-point gap between what people felt and what actually happened.
Why the slowdown — and why it's genuinely nuanced
It would be a serious mistake to read this as "AI makes developers slower, full stop." The result is narrow and specific. It applies to experts working in code they know intimately — precisely the situation where a human already holds most of the context in their head, so AI suggestions add review and correction overhead instead of saving lookup time. The likely contributors:
- Time spent reading and verifying AI output before trusting it.
- Fixing suggestions that were confidently wrong or subtly off.
- Context-switching between writing code and steering the assistant.
- Over-reliance on prompting for things a fluent expert would simply type faster by hand.
Flip the conditions and the picture can reverse. For unfamiliar codebases, boilerplate, unfamiliar languages, or less-experienced developers who lack that internal context, the same tools can deliver real speed-ups. The study measured one demanding scenario well; it did not measure all of them.
The headline isn't "AI is useless." It's "people are remarkably bad at judging their own productivity" — and that perception gap is dangerous precisely when you're deciding where to spend money.
What teams should take from it
- Measure, don't assume. Feeling faster is not the same as being faster — instrument real outcomes like cycle time, throughput and rework rates. This perception-versus-reality gap is the entire reason measuring AI ROI properly matters.
- Match the tool to the task. AI tends to help most where the human lacks context, and least where they already have the most. Roll it out where the gap is real.
- Beware vibes-based rollouts. Genuine enthusiasm from your team is wonderful, but it is not evidence. Run a small controlled comparison before you commit budget across the org.
There's also a leadership lesson tucked inside the perception gap. If your own engineers — the people closest to the work — can misjudge the effect of a tool by nearly forty points, then dashboards built on self-reported satisfaction or anecdotal "this saved me so much time" feedback are a shaky basis for a six- or seven-figure rollout decision. The fix isn't to distrust your team; it's to give them an honest measurement instead of asking them to estimate. Pick a couple of representative task types, run them with and without the tooling for a few weeks, and look at the clock and the rework rate rather than the mood in the room.
The deeper point is that this study is a model for how to evaluate any AI initiative: a quiet, measured trial beats a confident anecdote every time. If you'd like help designing an honest before-and-after measurement for your own team rather than guessing, we're happy to help set one up. And looking forward, expect the answer to keep shifting as tools and workflows mature — which is exactly why the habit of measuring, not assuming, is the durable takeaway.