Chain-of-thought: the paper that taught models to “show their work”

One of the most influential and least technical findings in modern AI is this: if you ask a model to think step by step, it gets noticeably smarter. That's the core of a 2022 Google paper, *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*.

(This is our explanation of the paper; the source is linked.)

The context

By 2022, large language models were impressive at producing fluent text but oddly unreliable at multi-step problems. A model that could write a passable essay would routinely fumble a two-step word problem — the kind a careful ten-year-old gets right. The standard assumption was that this was a capability ceiling: the model simply wasn't "smart" enough, and the only path forward was a bigger, more expensive model. The interesting thing about this paper is that it found a different lever entirely, and that lever was free.

What they tried

The researchers gave models multi-step problems — grade-school word maths, commonsense and logic puzzles — in two ways. First, the normal way: question in, answer straight out. Then a second way: they prompted the model to write out its intermediate reasoning before committing to a final answer, exactly like a teacher insisting a student "show your work" on a maths test rather than just circling a number.

In practice this could be as simple as including a worked example in the prompt that walks through the steps, so the model imitates that pattern on the new question. The everyday analogy is doing sums in your head versus on paper: forced to do a long calculation purely mentally, most people slip; given a scratchpad, the same people get it right. The reasoning steps are that scratchpad.

A worked example

Take: "A cafe had 23 muffins, sold 17, then baked 12 more — how many now?" Asked for a direct answer, a model might blurt out a wrong number. Prompted to reason step by step, it writes something like: "Start with 23. Sell 17, leaving 6. Bake 12 more, giving 18." Laying the arithmetic out in stages keeps each step small and checkable, and the final figure is far more likely to be right. The reasoning also gives a reviewer somewhere to look: if the answer is wrong, you can usually see which step went astray rather than just knowing the total is off.

What they found

The difference was large. On hard reasoning tasks, walking through the steps improved accuracy dramatically. There was also a striking pattern: the benefit grew with model size. Small models barely improved — or even got slightly worse — while large ones leapt forward. Reasoning-by-steps appeared to be a capability that only "switches on" once a model is big enough. And crucially, none of this required retraining. The ability was already latent inside the model; the right prompt simply unlocked it.

Why it matters

It's near-free capability. A better prompt, rather than a bigger or fine-tuned model, often gets a markedly better answer at no extra training cost.
It made AI more auditable. When a model shows its steps, a human can inspect the reasoning and spot where it went off the rails — vital in regulated or high-stakes work.
It seeded today's "reasoning" models. Modern systems that visibly "think" before answering are direct descendants of this idea, now trained in rather than prompted on the fly.

The headline is almost philosophical: the model could already reason — it just needed to be asked to do it out loud.

A related and widely used trick built on top of this is to ask the model the same question several times, let it reason through each independently, and then take the answer it lands on most often. Because the reasoning paths vary slightly, the correct answer tends to recur while one-off mistakes don't — a bit like polling several people who each worked the problem alone and going with the majority. It's a simple, practical way to squeeze more reliability out of step-by-step prompting when accuracy really matters.

The caveat

Here is the trap. A visible chain of reasoning looks convincing, but a tidy, confident-sounding explanation is not proof that the answer is right — and it isn't even guaranteed to be the real reason the model produced that answer. Models can reason their way to wrong conclusions, and they can write a plausible justification that has little to do with how they actually arrived at the output. So treat the steps as a useful aid for catching errors, not as a certificate of correctness. That gap between looking right and being right is exactly why AI agents still fail in production without proper checks and verification around them.

The practical takeaway

For teams, the lesson is that prompt design is real engineering, not a gimmick — and that surfacing a model's reasoning is most valuable precisely when a human is positioned to check it. If you're building something where a wrong answer is costly, the chain of thought should feed a review step, not replace one. We're glad to help you design those guardrails. Looking forward, the field has largely absorbed this finding into the models themselves, but the underlying principle endures: give a model room to work through a problem and it will usually do better than when forced to answer in one breath.

Sources

Wei et al. (2022) — *Chain-of-Thought Prompting Elicits Reasoning in Large Language Models*

Written by Zain Ali

Start a project →

Chain-of-thought: the paper that taught models to “show their work”

The context

What they tried

A worked example

What they found

Why it matters

The caveat

The practical takeaway

Sources

Keep reading

Chinchilla and the scaling laws: why bigger models aren’t always better

“Attention Is All You Need”, explained for non-engineers

The paper that introduced RAG, explained simply