Every AI feature has a cost that rarely appears in the plan: electricity. When a team scopes a pilot, the line items are usually engineering time, API fees and maybe some cloud storage. The energy a model burns every time it runs is invisible — right up until the feature succeeds, traffic grows, and the recurring bill becomes one of the largest numbers in the budget. The IEA's 2025 *Energy and AI* report puts numbers on the macro picture, and they're large.
The figures
- Global data-centre electricity demand more than doubles by 2030 to ~945 TWh — more than Japan's entire consumption today.
- AI-optimised data-centre demand more than quadruples by 2030.
- AI was 5–15% of data-centre power recently; could hit 35–50% by 2030. US demand alone exceeds 400 TWh by 2030.
These are not abstractions for someone else to worry about. They translate directly into rising compute prices, regional capacity constraints, and procurement teams that increasingly ask vendors hard questions about efficiency. The cost of intelligence is falling per token, but total consumption is climbing far faster — the classic rebound effect, where cheaper unit costs drive enough new usage that the aggregate bill goes up, not down.
AI's marginal cost feels like a per-token line item. At scale, it's a power-grid problem.
A concrete example
Picture a support tool that summarises every incoming ticket with a frontier model. At a few hundred tickets a day in a pilot, the energy and cost are a rounding error and nobody notices. Roll it out across a company handling a few hundred thousand tickets a day and you are now running millions of model calls — every one drawing power, every one metered. The same feature that was free to prototype is now a standing operational expense that competes with hiring. The decision that determines that bill — which model, how often, with how much context — was made early, almost by accident.
The reason this surprises people is that AI cost behaves unlike most software cost. Traditional features have a roughly fixed cost: you build them once and serving them is cheap. AI features have a cost that scales linearly with usage — every single request burns tokens, and tokens burn watts. Success makes the bill bigger, not smaller. A feature that's a hit is a feature that's expensive, which is the opposite of the usual software economics that teams' intuitions are built on.
The two efficiency levers
There are really only two ways to bring the bill down, and they map cleanly onto the two parts of an AI call. The first is doing less work per call: shorter prompts, retrieving only the context you actually need instead of stuffing the window, and trimming the output to what's required. The second is using a cheaper engine for the work: routing the easy majority of requests to a small model and reserving the expensive frontier model for the genuinely hard cases. Both reduce energy and cost in lockstep, and the best architectures use them together — small model, tight context, cached where possible.
Why builders should care
- Inference is recurring; at volume, efficiency is a feature, not a sustainability footnote.
- Smaller models win: Stanford notes a model matching 2022's flagship now runs with ~142× fewer parameters. Read more on why smaller is often smarter.
- Architecture decides the bill — caching, retrieval, and routing easy work to cheap models cut energy and cost together.
What this means for an engineering team
Treat energy and cost as the same design constraint, because they are. Before scaling an AI feature, model the per-call cost at realistic production volume, not pilot volume — the gap between the two is where budgets break. Then pull the levers that reduce both: cache responses for repeated queries, retrieve only the context you need rather than stuffing the prompt, and route the easy majority of requests to a small cheap model while reserving the frontier model for the hard minority. The cheapest, greenest call is the one you never make because a cached answer or a deterministic rule handled it. Efficiency designed in early is nearly free; efficiency retrofitted after launch is a rewrite. If you want help sizing this for a real workload, get in touch.
Sources
- IEA — Energy and AI
- Stanford HAI — 2025 AI Index