AI Inference Costs: The Wake-Up Call for 2026 and 2027

19 May 2026, 00:00

ai / budgets / inference / anthropic / github-copilot / cto / enterprise / costs

The era of fixed-fee AI spending just ended. If you’re a CTO or engineering leader and you haven’t noticed yet, you will very soon — probably around September 2026 when some budget alerts start firing.

I’ve been watching this play out for a while now. Ed Zitron wrote a great (and entertainingly profane) newsletter piece this week called “AI Is Too Expensive” that lays out the macro picture — the hyperscaler capex insanity, the lab economics that don’t pencil out, all of it. I’m not going to rehash all of that here.

What I am going to do is tell you what it means for your engineering budget right now.

Because here’s the thing: two very significant pricing shifts are happening simultaneously, and most engineering leaders I talk to are only vaguely aware of one of them.

The Two Shifts You Need to Know About

Shift One: Anthropic ended fixed enterprise pricing.

Starting in late 2025 (and formally landing early 2026), Anthropic restructured enterprise contracts. Gone are the fixed-seat bundles. In their place: a low base seat fee, plus full token consumption billed at API rates. If your engineers are heavy users of Claude Enterprise — coding, reviews, agentic workflows — your spend is now variable and uncapped.

Shift Two: GitHub Copilot goes usage-based on June 1, 2026.

This is the one that’s going to surprise a lot of people. GitHub Copilot — which most organizations have been treating as a fixed $19 or $39/user/month line item — is switching to AI Credits on June 1. The seat price stays the same. But that price now only covers code completions. Everything else — Copilot Chat with premium models, code review, agentic features — consumes credits at token rates.

Hoo boy.

If you have 200 engineers on Copilot Enterprise at $39/user, you’ve been budgeting $7,800/month. That number is about to become a floor, not a ceiling.

Wait, There’s a Catch

GitHub is giving customers a buffer during the transition. June through August, Business customers get $30/user/month in extra credits, and Enterprise customers get $70/month extra. That sounds nice.

Here’s my concern: that buffer is going to mask the real cost for three months.

The engineers who use Copilot Chat constantly, who run code reviews through it, who are experimenting with agentic features — they’ll burn through $39 of credits pretty fast. But with the extra buffer, it won’t hurt yet. Then September hits, the buffer is gone, and the bill arrives.

If you don’t use June–August as a metering exercise — tracking actual consumption per developer per day — you’re going to be flying blind into your fall budget review.

This. This. This. Use the free credits to measure, not just to spend.

What Is This Actually Costing Companies?

Let me give you some concrete data points (with appropriate caveats on sourcing).

Salesforce CEO Marc Benioff said on the All-In podcast earlier this month that Salesforce is on track to spend $300 million on Anthropic tokens in 2026. That’s a confirmed CEO statement, not an estimate.

Uber’s CTO disclosed the company burned through its entire 2026 AI budget in the first few months of the year, driven by Claude Code adoption across roughly 5,000 developers.

A Goldman Sachs research report found that AI inference costs at one unnamed software company were approaching 10% of total headcount costs — and trending toward headcount parity within a few quarters.

Read that last one again. Headcount parity. If you have a $10M engineering payroll, that’s a scenario where you’re spending $10M/year on AI inference on top of it. That math gets uncomfortable fast.

(To be clear: that Goldman figure describes one unnamed company. It’s a data point, not an industry average. But it’s not crazy, either, given the trajectory.)

Why Costs Aren’t Coming Down Quickly

I’ve seen this argument: “Well, silicon is getting cheaper. This will sort itself out.”

Maybe. Eventually. But not before 2027 planning season.

Here’s the structural problem. The labs — Anthropic and OpenAI — are not profitable on inference. Anthropic’s own gross margins came in at 40% in 2025, which was 10 percentage points below their own projections, with inference costs running 23% higher than anticipated (The Information, January 2026). OpenAI’s gross margins fell from 40% in 2024 to 33% in 2025, missing their own 46% forecast as inference costs grew fourfold year-over-year to roughly $8.4 billion.

Both labs are under enormous pressure to buy compute at whatever price it’s available because demand keeps outpacing their capacity planning. They’re not going to absorb those costs. They’re going to pass them upstream, through pricing pressure, contract restructuring (see: Anthropic’s enterprise billing change), and the elimination of the cheap fallback model options (see: Copilot removing lower-cost fallbacks).

Anthropic is investing in custom silicon — they’re in talks with UK startup Fractile on inference chips, and they have a major Google TPUv7 Ironwood commitment coming — but commercial impact from those is a 2027-and-beyond story.

Plan conservatively. Adjust when you see evidence.

The “Free Rein” Problem

Here’s something I’ve noticed talking to engineering leaders: a lot of them gave their teams open access to AI tools without any governance structure. “Figure out how to use it, see what’s valuable.” Totally reasonable instinct in 2024.

In 2026, with consumption-based billing, that’s a budget grenade with the pin pulled.

Engineers will use the best available model for everything because it’s better. They won’t throttle themselves — why would they? That’s your job, as the leader, to set the guardrails. The difference between a frontier model and a mid-tier model for a routine task can be 5–20x in token cost. If you haven’t thought about model-tier policies, now is the time.

There’s also a phenomenon playing out at some companies where engineers are burning tokens to hit adoption metrics rather than because it’s helping them. One of the data points Zitron cited (from internal Zillow reporting) had engineers burning tokens “just to hit internal AI adoption targets.” If you’ve tied any KPI to AI usage volume, you’ve created an incentive to waste money. Tie measurement to outcomes instead.

What To Actually Do Right Now

Here’s my practical take for the next 90 days:

Before June 1: Pull your Copilot usage reports. GitHub launched a billing preview tool in early May that shows projected credit consumption. Run it. See who the heavy users are and what they’re using credits for. That’s your baseline.

June–August: Use the buffer period to meter real usage, not just consume free credits. Track per-developer token consumption across all tools — Copilot, Claude, any direct API usage. That data is your 2027 budget anchor.

August/September: Build your 2027 AI budget with this structure:

Baseline: Seat costs (fixed)
Variable: Measured consumption × expected growth multiplier
Buffer: 2–3x your measured baseline — because adoption grows, models get more capable, and agentic workflows multiply token consumption in ways that are genuinely hard to predict

Now, not later: Implement model-tier governance. Don’t let engineers default to the most expensive model for every task. Set policies. This is the highest-leverage cost control available to you.

The Uncomfortable Bottom Line

We’re at the point in the AI hype cycle where the bills are starting to arrive. The technology is genuinely useful — I use it every day, I believe in it — but “useful” and “free” are two different things. We’re moving from a world where AI tools were subsidized loss-leaders (Anthropic and OpenAI burning cash to grow usage) to a world where the cost structure is being passed to enterprise customers.

I’ve been in this industry long enough to have watched this pattern play out with cloud computing, with SaaS, with mobile. The early days of “free” or cheap don’t last. What comes next is cost management, governance, and ROI accountability.

That’s not a bad thing. That’s how good technology earns its place in a budget.

One More Thing: The Case for Local Inference

Here’s something I haven’t seen enough people talking about in the context of costs: run it locally where you can.

Not everything needs a frontier cloud model. A surprising amount of real workload — code completion on well-defined tasks, document summarization, classification, structured extraction, in-product AI features — can run on smaller models on local hardware. And the hardware story has gotten genuinely interesting. We’re not just talking about beefy developer workstations. Edge devices with capable NPUs, small servers with consumer GPUs, even some of the newer SBCs are starting to handle quantized models at usable speeds.

I’ve been dorking around with this more and more lately, and I’m convinced that “local inference where possible, cloud for the heavy lifting” is going to be a real architecture pattern for cost-conscious teams in 2027. Not a replacement for the frontier models — for complex reasoning, long context, and the hard stuff, you still want the cloud. But for the long tail of routine tasks? The economics make a compelling case.

There’s also a privacy angle that some organizations are sleeping on. Tokens you never send to Anthropic or OpenAI are tokens you never paid for, and data you never exposed. That matters in regulated industries.

I’m going to dig into this much more in future posts. The AI-at-the-edge story is where I think a lot of the genuinely exciting development is going to happen in 2027 — both on the hardware side and the model efficiency side. More on that soon.

Conclusion

The June 1 GitHub Copilot transition is not a small change. Combined with Anthropic’s enterprise billing restructuring, it marks the end of fixed-fee AI tooling for most engineering organizations.

Your 2026 H2 budget is probably wrong. Your 2027 budget will be wrong too if you don’t use the next three months to actually measure what things cost.

Use the buffer credits as a metering exercise. Build in a 2–3x buffer for 2027. Implement model-tier governance before the fall. And stop tying KPIs to usage volume — you’ll just be paying for busywork.

The technology is worth it. Just know what you’re paying for.

Sources: Ed Zitron, “Where’s Your Ed At” newsletter (May 19, 2026); GitHub Blog on Copilot usage-based billing (May 2026); The Information reporting on Anthropic and OpenAI margins (January–February 2026); Goldman Sachs research; Marc Benioff on All-In podcast (May 2026); The Register on Anthropic enterprise billing (April 2026).