As models got better at maths and logic, a fair question grew louder: are they actually reasoning, or just recognising problems they've effectively seen before? A 2024 paper from Apple researchers, *GSM-Symbolic*, ran a clever test to find out.
(Our plain-language summary of the study; the paper is linked.)
Why the question needed asking
Part of the problem is how progress gets measured. For years, the headline evidence that models could "do maths" was their score on a benchmark called GSM8K — a fixed set of grade-school word problems. But there's a catch with any fixed test: the questions, and answers very like them, may well have appeared somewhere in the model's training data. A high score could mean genuine reasoning, or it could mean the model has effectively seen the answer key. From the outside, those two look identical. The Apple team set out to tell them apart.
The experiment
The trick was to stop using a fixed test and instead generate fresh variants of the same problems. They built a system that takes a grade-school maths problem and produces many versions of it by swapping in different names and numbers, while keeping the underlying structure and logic completely unchanged. "Tom has 5 apples" becomes "Sara has 8 apples" — same problem, different surface. A model that genuinely understands the maths should score the same across all variants, because the reasoning required is identical.
They then went a step further and added a single irrelevant but related sentence to each problem — a true-but-useless detail, the kind a person reads, recognises as a distraction, and simply ignores when doing the sum. For instance, in a problem about counting fruit, they might mention that some of the fruit was "a bit smaller than average" — a detail that changes nothing about the arithmetic but sits temptingly in the text. A human solver shrugs it off; the interesting question was whether the models could.
What they found
- Just changing the names and numbers caused measurable accuracy drops across many leading models, and scores wobbled noticeably from one variant to the next. A genuine reasoner shouldn't care whether the apples belong to Tom or Sara.
- Adding one irrelevant sentence caused large accuracy drops — in some cases dramatic ones. The models were repeatedly pulled off course by information a child would dismiss out of hand, often trying to fold the useless number into the calculation.
- Performance grew less reliable as problems got more complex, with more steps to chain together — accuracy didn't just dip, it became harder to predict, which is its own kind of risk.
The interpretation the authors reach is sobering: a lot of what looks like reasoning is closer to very sophisticated pattern-matching against training data. The models have learned the shape of these problems extremely well, but matching a shape is more fragile than understanding the underlying logic. A useful way to picture it: a student who has genuinely understood long division can do it with any numbers you throw at them, while a student who has memorised the worked examples in the textbook does fine until you change the digits or slip in a distractor — at which point the cracks show. GSM-Symbolic was, in effect, a way to swap the digits and add the distractors at scale, and the cracks duly showed.
The models aren't "thinking" the way the polished demos suggest. They're extraordinary pattern machines — and patterns break in predictable ways.
An honest caveat
It's worth holding this finding at the right altitude. The study focused on a specific class of grade-school maths problems, the field moves quickly, and newer models trained explicitly to reason have improved on exactly these kinds of stress tests. The paper is best read not as "models can never reason" but as "be careful about taking benchmark scores at face value, and expect brittleness at the edges." That caution remains sound regardless of which model you're using. There's also a healthy debate about where the line between "real reasoning" and "very good pattern-matching" even sits — for a lot of practical purposes the distinction matters less than the observable fact that small, irrelevant changes can swing the output, and that's the part you can plan around.
Why this matters for real products
This isn't a reason to avoid AI — it's a reason to design around its limits rather than assume they aren't there:
- Don't assume a confident, well-formatted answer is a correct one. Fluency is not accuracy.
- Test on your edge cases and reworded, real-world inputs — not just the tidy happy path the vendor demoed.
- Keep a human in the loop wherever an error is expensive or hard to reverse.
It's the same theme behind why so many agentic AI projects underdeliver: the capability is real, but reliability has to be deliberately engineered, not assumed into existence. If you want help pressure-testing where a model is likely to break on your own data, that's exactly the kind of thing we do. The forward-looking note is encouraging but conditional — models keep getting more robust, yet the discipline of testing on your inputs rather than trusting a leaderboard never goes out of date.
Sources
- Mirzadeh et al. / Apple (2024) — *GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in LLMs*