Skip to content
← Back to blog
AI Strategy·June 4, 2026·6 min read

Inference got 280× cheaper in 18 months. Here’s what it unlocks

The collapsing cost of running models is the most underrated story in AI. It changes which products are viable — and which moats disappear.

Model capability gets the headlines. The quieter, more consequential trend is price. Per Stanford's 2025 AI Index, the cost of GPT-3.5-level inference fell from $20 to $0.07 per million tokens in 18 months — a 280-fold drop. Numbers that large are hard to feel, so put it in human terms: a task that cost a dollar now costs about a third of a cent. Things that are 280 times cheaper don't just get used more — they get used in ways nobody bothered to imagine when they were expensive.

What cheap inference unlocks

  • New product shapes. Features that were uneconomical at $20/M tokens — summarising every document, classifying every ticket, drafting every reply — become trivial at $0.07.
  • Volume over cleverness. You can call a model many times (draft, critique, revise) where you once had to be sparing.
  • Smaller models, same job. A model matching 2022's flagship now runs with ~142× fewer parameters, pushing capability to the edge and on-device. More on when smaller is smarter.

The deepest shift is in how you are allowed to design. When each call was costly, the dominant pattern was a single, carefully engineered prompt that had to get the answer right in one shot. When calls are nearly free, you can chain them: have the model draft, then critique its own draft, then revise — or generate three candidate answers and pick the best. Quality you used to chase through clever prompting you can now buy with a few extra cheap calls.

When the cost of intelligence drops two orders of magnitude, the constraint stops being "can we afford to call the model?" and becomes "have we designed the product to deserve it?"

A concrete example

Consider classifying every support ticket by topic, urgency and sentiment. At $20 per million tokens, doing that on every ticket in a high-volume queue was a real budget conversation, so teams sampled, or skipped it. At $0.07, you classify everything, in real time, and route automatically — and the feature that was once a "maybe next year" item becomes a Tuesday afternoon's work. The product didn't get more clever; the economics moved under it.

Why this keeps happening

The 280-fold drop is not a one-off discount; it is the visible result of several compounding forces. Hardware gets faster and cheaper per unit of compute. Inference techniques — quantisation, distillation, better serving infrastructure — squeeze more throughput from the same chips. And fierce competition between model providers pushes prices toward cost. Most importantly, smaller models keep matching the quality that used to require large ones, so the same task migrates to a cheaper engine over time. None of those trends has obviously run out of room, which means the sensible planning assumption is that inference keeps getting cheaper, not that today's prices are the floor. Designing as if a capability will be cheaper next year than this year is usually the right bet.

What this means for your roadmap

The practical discipline this demands is to revisit your "too expensive" list on a schedule. Features you correctly ruled out eighteen months ago on cost grounds may now be comfortably affordable, and the team that re-evaluates them first gets to ship them first. Cost-driven "no" decisions have a short shelf life in this market — treat them as expiring, not permanent.

The flip side

Cheap inference also erodes moats. If a capability is one cheap API call away, it isn't a differentiator — your competitors have the same call. What remains defensible is what the model can't buy off the shelf: your proprietary data, the workflow you've wrapped around the model, the quality of your execution, and the trust of your users.

What this means for your strategy

  • Re-scope ideas you shelved on cost. A backlog item that was "too expensive to run at scale" 18 months ago may now be trivially viable.
  • Spend the savings on quality, not just volume. Use multi-pass patterns and evaluation harnesses to make output genuinely better.
  • Build the moat the model can't copy. Invest in data, integration and UX — the parts that survive when the underlying capability is commoditised.

The teams that win the next phase aren't the ones with access to the smartest model. Everyone has that. They're the ones who design products that earn the now-cheap intelligence they're built on. If you want to pressure-test an idea against these economics, let's talk.

Sources