June 13, 2026 · Edition #87

You Don't Need the Best Model.

You've been paying frontier prices to win cases your business doesn't have.

A quick note before today's letter.

After two years, How to Think with AI is live on Amazon this morning.

Everyone has the same AI tools. Some produce genius. Most produce gorgeous garbage. The difference is the human operating the machine.

This book is about that difference.

This is not a tool tutorial, nor a hype book. It’s a thinking book. For ambitious professionals who sense they're getting average work out of AI, and want to know why.

If you've felt the gap between your AI use and someone else's, this book closes it.

Other platforms are coming shortly. I'll share the full landing page with all the links as soon as it's ready. For now, if you're on Kindle, you can grab it here.

Thanks for being here while I built this. Now to today's letter.

Sofia forwarded me her cloud bill on a Tuesday. The subject line was: "we need to talk."

$47,000 for the month. Five engineers on the team. Most of it: API tokens.

She runs ops at a Series B startup. Last year, the same team's AI bill was $4,200 a month. Same people. Same product. Same outcomes. Eleven times the spend.

I asked her what changed.

"We switched to the latest model. We're using agents now. Everything is reasoning."

I told her she didn't have an AI problem. She had a Chaser problem.

There's a category of professionals emerging since 2023 that I've started calling the AI Chaser.

You probably are one. So are most of your peers. So was I, months ago.

The Chaser defaults to whatever model is topping the leaderboard. The Chaser pays the metered API rate on every prompt, every token, every chain-of-thought trace, no matter how trivial the task. The Chaser switches the moment a new benchmark king is crowned, gets briefly impressed, then forgets. The Chaser's bill is, by architectural design, uncapped.

The Owner thinks differently. The Owner asks one question:

What's the cheapest model that's good enough for this specific task, hosted in a way I control, with a fixed monthly ceiling I can defend in a board meeting?

This letter is how to stop being one and start being the other.

The Token Tax

Sofia's bill didn't go from $4K to $47K because her team did eleven times more work. They did slightly more work, on a more expensive model, in a more expensive way. Three things compounded into a tax most teams are paying without noticing.

1. Latest-model premium. Frontier models cost roughly 10–30x more per token than the model one generation back. The benchmark gains are real and modest. The pricing differential is real and not modest.

2. Agentic loops. Agents don't make one call. They make a chain. Sometimes hundreds. An agent that "reasons" before it acts spends most of its tokens on the reasoning step, not on the answer. A simple classification used to cost one API call. The agentic version costs eight to fifty.

3. Reasoning models. The current generation of "thinking" models bills you for the chain of thought, the internal monologue the model produces before it gives an answer. You don't see most of it. You pay for all of it. Reasoning trace tokens are billed at the same rate as output tokens. A simple question with a 50-token answer can have a 5,000-token reasoning trace behind it. You just paid for a hundred answers to get one.

Multiply these three together. (-75% per-token cost) × (250× usage) = the bill that landed in Sofia's inbox.

This is the token tax. Most teams are paying it without realizing they could be paying a fraction.

What I Actually Measured

OK. So if you switch to a cheaper or smaller model, you lose performance, right?

Less than you've been told.

Over the last six months, my team and I ran a side-by-side. The same workloads, through Claude Opus on one side and a self-hosted Qwen 3.5 on the other. Not synthetic benchmarks. Our actual client deliverables: classification, structured extraction, multi-step research, code review, document synthesis.

The performance gap on our real tasks: 5–10%.

The cost gap: 80–95%.

(sometimes 10x cheaper).

That ratio, one, is the whole letter.

I want to be clear what this is and isn't. It's internal measurement on our specific task distribution. It's not a peer-reviewed paper. It's not a leaderboard. The 5–10% is task-dependent. On some tasks the open-weight model matched the frontier head-to-head. On a handful of harder tasks, the frontier was meaningfully better, and we still route those calls there.

But here's the thing about leaderboards. They measure the hard cases that distinguish the top model from the next tier. Your business does not run on the hard cases. Your business runs on the 80% of routine cases where the difference is invisible to the end user, the customer, and the auditor.

You've been paying a frontier premium to win the cases your business doesn't actually have.

That's the Chaser tax in one sentence.

The Architecture Move

The Owner makes one move and it changes everything downstream.

Pick a good-enough model. Host it on hardware you control. Cap your monthly cost at the hardware lease. Done.

This sounds like architecture talk. It’s really just money talk.

Right now, your AI spend is uncapped. It grows every time people use it more.

Flip it to a fixed setup, and the bill stops moving around.

Now you can scale usage without sweating every invoice.

Concretely. Sofia's $47K/month API bill, replaced with a self-hosted Qwen 3.5 setup on a dedicated GPU instance, lands around $6–8K/month for the same workload, billions of tokens included, no per-token panic. Saved: $39–41K/month. Annually: roughly half a million dollars. For a Series B startup, that's the difference between hiring two more engineers and not.

The market is already pivoting, whether you've noticed or not. Over 40% of enterprise AI workloads now include a local inference component. HuggingFace reports downloads of quantized model weights are up 320% year-over-year.

Gartner’s calling it: by 2027, small, purpose-built models will be 3x more common than general LLMs.

And honestly, from what I see… budgets are already reflecting that shift.

The Tradeoffs (Because There Are Some)

Three counter-facts you have to absorb, or this whole thing becomes the BS we exist to mock.

Frontier still wins the hard stuff. For genuinely complex multi-step reasoning, broad world knowledge, edge cases, and long multi-turn conversations, the frontier API is still the right call. The honest architecture is hybrid: 80–90% of your traffic to the self-hosted good-enough model, with the uncertain cases routed to a frontier API. (That routing layer is self-consistency, Letter 84. It's the mechanism that makes this work without flying blind.)

Even Apple is hedging. Apple, the company that has spent nine years building Neural Engine silicon for on-device AI since the A11 in 2017, the cleanest embodiment of the edge bet, recently licensed Google's Gemini to help power Siri's newer capabilities. The frontier still does things the edge can't do yet. If Apple is hybrid, you should be hybrid. The Owner move isn't "never touch a frontier API." It's "default to good-enough, escalate to frontier on demand."

Most edge AI initiatives are not shipping. A January 2026 Spectro Cloud survey found that only 11% of edge AI projects had reached full-scale production. The reasons were predictable: model-lifecycle management, governance, monitoring. Self-hosting is not free. You're trading a metered cost for operational complexity. If your team can't run the operational discipline, the Owner move backfires.

That’s why I keep saying this isn’t a “buy” decision. It’s an architecture decision.

You’re not buying a new model. You’re buying the annoying stuff around it: routing, governance, monitoring.

If you can’t run the boring parts, don’t pretend you’re going to self-host. You’ll just move the mess from the cloud to your own servers.

Cloud Won Training. Edge Wins Inference.

Step back for one second, because this is the part that explains why the math is going to keep getting worse for Chasers.

The four big hyperscalers are projected to spend $650–725 billion on capex in 2026, roughly 75% of it AI-specific. Meta's stock fell over 9% in a single day when they raised their capex guidance. The whole industry is being financed with debt at historically unprecedented capital-intensity ratios.

That capex is going somewhere. It's going to training, to the GPT-class clusters that no individual or small team will ever match. Cloud won training. That's settled.

But training and inference are two completely different markets.

Inference is what touches the user. Inference is the email, the support response, the document summary, the code review. Inference is where the relationship lives. Inference is where the margin lives. Inference is where the lock-in lives.

And inference is moving to hardware people already own. The new Apple M5 Max ships with 128GB of unified memory at 614 GB/s and runs a 70B-parameter model locally with no compromises. The A19 chip in your phone hits 60–70 TOPS, roughly the territory of a 2017-class data-center GPU, sitting in your pocket. Mobile NPUs are catching up faster than anyone in the cloud business wants to admit.

The hyperscalers know this. That's why Meta's stock moved. The token-billing business model is mispriced for a future in which inference happens on devices that aren't theirs.

The companies positioned for that future aren't the API vendors. They're the silicon companies. Apple. NVIDIA (which wins both sides, in fairness). Qualcomm. The Chinese hardware sector building good-enough chips at a fraction of the cost.

You don't have to bet on this future to act on it. You just have to notice it's arriving in your AWS bill, one renewed contract at a time.

What to Do Monday Morning

If you're a CTO, founder, or team lead with a non-trivial AI bill, here's the audit. None of it requires new tools.

1. Pull your AI spend for the last three months. Per workload, not aggregate. You're looking for the 10–20% of workloads that consume 80% of the budget. That's the Pareto target.

2. Pick your top three workloads. Run a side-by-side. Take 50–100 real production inputs. Send them through your current frontier API and through a self-hosted "good enough" model (Qwen 3.5, Kimi, whatever fits your task class). Score the outputs against what you actually shipped. Compute the performance gap and the cost gap.

If your performance gap is in the 0–10% range and your cost gap is in the 80–95% range, you have a Chaser problem and a Monday-morning fix.

3. Migrate one workload. Cap the cost. Don't migrate everything at once. Pick the highest-volume, lowest-risk one. Set up the self-hosted model on a fixed-cost instance. Wrap a routing layer around it (Letter 84 again, the uncertain cases still go to the frontier API). Monitor for a month.

4. Bring the number to your CFO. Not as a savings story. As an architecture story. You converted an uncapped variable expense into a capped line item. That's the language that wins you next quarter's budget instead of next quarter's audit.

You don't need to bet the company on this. You need to prove it on one workload, then expand.

Back to Sofia

She ran the audit last month. She picked her highest-volume workload, document classification, about 60% of her token spend. She stood up Qwen 3.5 on a $1,500/month GPU instance. Wrapped a routing layer around it for the uncertain cases. Migrated.

Her bill this month: $9,400.

Performance complaints from her team: zero. Performance complaints from her customers: zero.

She still pays for Opus. She still uses it. For the cases that need it, which is now about 15% of her traffic, or roughly $3K of her current bill. The other $35K she used to spend went into hiring an engineer.

You don't need the best model. You need the right model, on hardware you control, with a bill that doesn't surprise you.

AI is only as good as the human operating it. Let’s stop following benchmarks and new model releases blindly.

Have a great weekend.

Stay sharp.

— Charafeddine (CM)