Chinchilla and the scaling laws: why bigger models aren’t always better

For a few years the AI race looked like a simple contest: whoever trains the biggest model wins. In 2022, a DeepMind paper called *Training Compute-Optimal Large Language Models* — better known as the Chinchilla paper — showed that was the wrong race to be running. Here it is in plain terms.

(This is our explanation of the paper, with the original linked so you can check the source.)

The world before Chinchilla

By 2020–2021, the field had a strong working assumption: parameter count was the headline number. Each new model boasted a bigger figure than the last — billions, then hundreds of billions of parameters — and bigger generally did mean better, so the arms race made sense on the surface. The unspoken belief was that, given more computing budget, the smart move was to pour most of it into making the model larger. The amount of training data was treated almost as an afterthought.

The question the paper asked

Chinchilla reframed the problem as a budgeting question. Suppose you have a fixed budget of computing power — a fixed number of GPU-hours you can afford. What is the best way to spend it: build a bigger model, or train a smaller one on more data? These trade off against each other, because both cost compute, and you only have so much to go around.

A useful analogy: think of training a model like training an athlete on a fixed budget. You can spend it on raw size — more muscle, a bigger frame — or on practice hours. A huge athlete who has barely trained will lose to a moderately sized one who has trained intensively. The earlier era kept buying size and skimping on practice. Chinchilla's contribution was, in effect, to map out how to split the budget between size and practice to get the strongest competitor for your money — and to show that the prevailing split was badly off.

What they found

The researchers trained many models of different sizes on different amounts of data, then mapped the results. A clear pattern emerged: most state-of-the-art models of the day were too big and badly undertrained. For a given compute budget, you get a better model by making it somewhat smaller and feeding it far more data — and crucially, you should scale model size and training data together, roughly in proportion, rather than ballooning one and neglecting the other.

To prove it, they trained a model they named "Chinchilla." It was about 4× smaller than the era's flagship giant (DeepMind's own Gopher) but trained on a far larger amount of data using the same compute budget. The smaller, better-fed model won — beating the much larger one across a broad range of benchmarks, despite costing the same to train.

The practical formulation that came out of this work is easy to remember: for compute-optimal training, the number of training tokens should grow roughly in step with the number of parameters. A rule of thumb people took from it was on the order of twenty-or-so training tokens for every parameter — far more data, relative to size, than the giants of the day had been fed. Whether or not you remember the exact ratio, the shape of the advice is what stuck: don't let the model outrun its data.

Why it mattered

Efficiency over raw size. It reframed the goal from "biggest" to "best-balanced." Nearly every capable model released since has followed Chinchilla-style data-to-size ratios.
It made smaller models viable. A well-trained smaller model can clearly beat a poorly-balanced large one — part of why small, specialised models are now so competitive for real workloads.
It changed the economics. A smaller compute-optimal model is cheaper to run, not just to train — and running costs compound every single day a product is live, so this is where the savings really land.

The lesson wasn't "models don't need to be big." It was "size without matching data is wasted money." Balance beats brute force.

Honest caveats

Chinchilla optimised for training compute — the cost of building the model once. But most of a deployed model's lifetime cost is inference: every query a user sends. If you expect to serve a model billions of times, it can actually be rational to "over-train" a smaller model well past the Chinchilla point, accepting a more expensive build in exchange for a permanently cheaper-to-run product — which is exactly what several later open models did. The paper also assumed plentiful high-quality training data, and the industry has since bumped into the limits of how much good text is available. So treat Chinchilla as a foundational correction, not a fixed recipe.

The business takeaway

When a vendor pitches you "the biggest model," the right question is rarely "how big" — it's "the most appropriate model for this job." For the majority of real workloads, a smaller, well-matched model is faster, cheaper and entirely good enough, and it leaves budget for the parts that actually differentiate your product. That is precisely the build-vs-buy calculation worth running before you commit, and it's a conversation we're glad to have with you. Looking ahead, expect the frontier to keep shifting from "train the largest thing" toward "train the most efficient thing, then serve it cheaply at scale."

Sources

Hoffmann et al. (2022) — *Training Compute-Optimal Large Language Models*
Stanford HAI — AI Index 2025 (efficiency trends)

Written by Zain Ali

Start a project →

Chinchilla and the scaling laws: why bigger models aren’t always better

The world before Chinchilla

The question the paper asked

What they found

Why it mattered

Honest caveats

The business takeaway

Sources

Keep reading

Chain-of-thought: the paper that taught models to “show their work”

“Attention Is All You Need”, explained for non-engineers

The paper that introduced RAG, explained simply