Skip to content

May 23, 2026 · Edition #84

Ask Your AI Five Times.

If you get five different answers, you almost shipped a guess.


Last month I sat in on an AI agent demo at a mid-sized SaaS company.

The agent was beautiful.

It read the support ticket. Searched the docs. Drafted a reply. Updated the CRM. Closed the loop. Total time: 11 seconds.

Everyone clapped (politely, because we were on Zoom).

Then someone asked the one question that breaks every AI demo I've ever watched:

"What happens when it's wrong?"

Silence.

The kind of silence where you can hear the founder's pricing model start to wobble.

Eventually somebody muttered:

"Well, we'll monitor it."

Translation: we don't have a system. We have a vibe with an API key.

That demo was three weeks ago. Thomas called me about a different version of the same problem last Wednesday.

 

Thomas runs AI ops at a regional bank.

"CM, I have 9 agents in production. Three are running on autopilot, and one of them quietly did something bad last week. The other six have humans reviewing every single output. My team is exhausted. My boss wants more agents. My compliance lead wants fewer. I don't know how to ship the next twelve."

Took me about thirty seconds to realize this wasn't a strategy question.

This was a physics question.

There are exactly two failure modes Thomas is choosing between:

  • Full autonomy. Agent runs alone. Looks fast in the demo, fails subtly in production, eventually kills a customer relationship or a regulator's patience. (See Letter 83. The graveyard.)

  • Full human-in-the-loop. Every output reviewed by a person. Looks safe on the slide deck. Scales to about three agents. Then the humans become the bottleneck and start rubber-stamping. (See Letter 77. Oversight cosplay.)

These are not "two valid approaches with tradeoffs." These are two ways to lose.

The honest answer most people don't want to hear is the one Thomas needed:

100% autonomous agents don't exist. 100% human-in-the-loop doesn't scale.

But there is a middle. The middle is where production AI actually lives. And almost nobody is talking about it because it doesn't fit on a keynote slide.

Let me show you.

 

The Two Lies the Industry Keeps Selling You

Lie #1: "Agents can run your entire business. Hire agents not humans."

The autonomy fantasy. The demo guys love it. The VCs fund it. The screenshots go viral on X.

It's a story for the 90 seconds the agent is on stage. Then it goes home. Real work is messy. A customer says one thing and means another. A legal clause looks standard until it isn't. A support ticket looks simple until the customer mentions, casually, at the bottom of paragraph six, that this issue is blocking a $200K renewal.

The agent does not see paragraph six. The agent files the ticket as "low priority, refund issued."

You find out in March, when the customer churns and the autopsy is your problem.

Lie #2: "Just keep a human in the loop. Review everything."

The safety theater answer. It sounds responsible. Mathematically, it's the same as not deploying AI at all.

If a human has to read every output, you didn't automate anything. You hired a very expensive copy editor and stapled a $200K inference bill to its forehead.

And worse, humans become terrible reviewers when you ask them to approve too much.

First 50 outputs, they pay attention.

Next 500, they skim.

Next 5,000, they're a button-clicking mammal.

Approve. Approve. Approve. Approve.

Then something blows up, and everyone pretends to be surprised. That's not oversight. That's a CAPTCHA with a salary.

So no, the answer is not full autonomy. And no, the answer is not full review.

The answer lives in the middle. And nobody is selling it to you because it doesn't fit on a keynote slide.

 

The Trust Problem Isn't What You Think

When we talk about whether an AI can be trusted, we usually mean it like a binary.

Either the model is good enough to act alone, or it isn't.

That framing is broken.

Trust isn't a property of the model. Trust is a property of this specific output, right now, in this specific context. Most outputs from a good model are reliable. Some outputs from the same good model are not. The trick is knowing which is which before it ships.

The current state of the art for "knowing which is which" at most companies is a human. A human reads the output. The human decides if it's trustworthy. The human's brain is the verification layer.

Then the human gets 200 outputs to review per day. And we are back to Letter 77.

Here's what most people miss: the model already has the information about whether it's confident or not. The model just isn't telling us.

We don't need smarter models. We need models that signal their own uncertainty.

That's where self-consistency comes in.

 

Ask the Model Five Times. Then Look.

Here's the whole move.

Instead of asking the model once, you ask it multiple times. Different conversations. Not the same conversation.

Not because you enjoy burning tokens. Because disagreement contains information.

If you fire the same prompt at the model five times and get five answers that basically agree, the model is stable on this case. You can ship it.

If you fire it five times and get five wildly different answers, the model is guessing. Confidently, beautifully, with bullet points and an incredibly persuasive narrative, but guessing.

Don't ship the guess. Route it to a human.

That's it. That's the entire pattern.

It's called self-consistency. I published a paper on it (arXiv 2602.21368) and shipped it as an open-source library (TrustGate) so you don't have to build it from scratch. It's not magic. It's not new math. It's conformal prediction, an idea sitting in the statistics literature since the 1960s. We just applied it to LLM outputs and gave it a friendly name.

It's somehow not in 95% of production AI deployments because everyone is too busy chasing the next model release to build a system around the one they already have.

Ask Your AI Five Times.

The Shift You Need to Internalize

Read this part carefully, because it's the actual lesson and most people miss it.

In the autonomy fantasy, uncertainty is a hidden risk. The model doesn't know what it doesn't know. It produces a wrong answer with the same calm tone it produces a right one. You ship it. The boat hits the rocks.

In the self-consistency setup, uncertainty becomes a routing signal. You can see it. You can measure it. You can act on it.

You don't need the model to know what it doesn't know. You just need to ask five times and look at the spread.

Most teams treat uncertainty like an embarrassment. They lower the temperature. They fine-tune. They pray. They want the model to seem certain.

That's backwards.

You don't want the model to SEEM certain. You want to be able to TELL when it isn't.

Disagreement between samples is the most useful signal the model gives you. It's the model raising its hand and saying "I'm not sure about this one." Most teams have spent two years engineering that hand-raise out of their systems. Then they wonder why production keeps breaking.

 

One Concrete Example

You run customer support for a SaaS product. A user writes:

"Got charged twice after I upgraded. Also thinking about cancelling, this has happened before."

A fully autonomous agent apologizes, issues a refund, closes the ticket. Looks great in the demo. Completely misses that this is a high-value account with a churn signal and a billing bug.

A fully human-in-the-loop system sends this to a person, along with the other 9,999 tickets that week. The person stops reading carefully around ticket 200.

A self-consistency system asks the model five times to classify the case:

  • Sample 1: "High risk. Likely churn. Escalate."

  • Sample 2: "Medium risk. Process refund."

  • Sample 3: "High risk. Billing pattern detected. Escalate."

  • Sample 4: "Low risk. Apologize and refund."

  • Sample 5: "High risk. Repeated issue. Needs human."

The samples disagree. The system flags this one ticket, with the five different opinions attached, and routes it to a human who can actually judge it.

The other 9,999 routine tickets that week? Flow through automatically.

The human isn't drowning. The human is doing the actual job: handling the cases where judgment matters.

 

"But Isn't Five Calls Expensive?"

Yes. About 5x the inference cost on that call.

You know what's more expensive? One wrong refund processed automatically. One wrong contract clause autofilled. One wrong medical recommendation that the clinician trusted because the tone was confident.

The math is not close. 5x of pennies is still pennies. The mistakes are not pennies.

Run the actual numbers on Thomas's bank.

Before. Every output gets a 5-minute human review. 200 outputs a day means 1,000 minutes of senior reviewer time. Roughly 16 hours. Two full FTE-days. Cost: about €400 in salary, plus the opportunity cost of two of your best people not doing their actual jobs.

After. 80% of outputs are confident and ship automatically. The 20% that need review = 40 cases × 5 minutes = 200 minutes, about 3.3 hours. Add 5x API spend on the agent call: maybe €30 extra per day in inference.

Net: roughly €320 saved per day. Per agent. Per workflow. Multiply across nine agents. Across a year.

Also: you don't run self-consistency on everything... of course. Summarize an email? One call. Draft a tweet? One call. Decide whether to refund a $200K enterprise account? Five calls. Ten. Whatever the stakes justify.

Self-consistency is a budget you spend where the rocks are.

 

Where Self-Consistency Breaks (And How to Fix It)

I'm not going to pretend this is a silver bullet. Three failure modes you should know about before you ship.

1. Confidently wrong.

A model can be uniformly wrong across all samples. 10 out of 10 agree, but they are all confidently incorrect. Self-consistency catches uncertainty. It does not catch systematic error.

The fix: pair self-consistency with a small set of deterministic checks. Did the cited document exist? Is the dollar amount inside the allowed range? Does the output match the required schema? Cheap, mechanical, complementary. Self-consistency tells you when the model is unsure. Deterministic checks tell you when the model is wrong in a way you can describe.

2. Sample diversity.

If your sampling temperature is too low, every sample comes back identical and you get false 100% agreement. If it's too high, even good answers diverge and you get false uncertainty.

The fix: tune temperature per task. For factual lookup, lower. For creative generation, higher. Run a small calibration set of inputs with known correct answers, sweep the temperature, find the sweet spot. Afternoon of engineering work. Once.

3. Long-form drift.

For multi-paragraph outputs, two samples can agree on the first half and diverge on the second. A single global agreement score hides that.

The fix: chunk the output and score each chunk independently. Surface chunk-level disagreement to the reviewer, not just a global number. The reviewer should see which sentence the model wasn't sure about, not just "this output is suspicious."

These are real problems. They are also tractable. You don't need a PhD to handle them. (I have one, and I'm telling you that.)

 

What Changes in Practice

Once you have a routing layer, three things shift in your AI operation.

1. The "human in the loop" stops being a bottleneck and becomes a privilege. Your senior reviewers only see the uncertain 20%. They are not drowning. They are using their expertise. They are doing the job they were hired to do.

2. The model's reliability becomes measurable. Before: "I think it works pretty well." After: "Confidence above 0.8 on 84% of production calls last week. Routed 16% to humans. Of those, 12% needed correction. Net error rate: 1.9%." Those are numbers you can put on a slide. Those are numbers you can sign off on with your auditor. Those are numbers you can defend in a quarterly business review.

3. The deployment conversation changes. You stop arguing about whether to ship the agent. You start arguing about where to set the threshold. That's an engineering question with a clear answer, not a philosophy question with no answer.

This is the conversation Thomas's bank needs to be having. Instead, they're having a debate between two people who have read different LinkedIn posts.

 

Back to Thomas

I sent him a one-pager. Five sample queries. Three temperature settings. A starting threshold of 0.75. One instruction: do the math on one workflow, not all nine.

He texted me Friday.

"Ran it on the loan-document classifier. 87% of outputs auto-ship at threshold 0.8. The 13% that get routed to humans are exactly the cases my best analyst flagged manually last month. The system is doing the triage I was doing in my head, except faster and without forgetting."

That's the whole job.

You build a routing layer that surfaces what's worth a human's attention. You stop pretending the agent is autonomous when you don't even know which calls it should be making alone.

 

What to Do Monday Morning

Three things. None of them require buying new tools.

1. Pick one agent. Pick one workflow.

Not the whole stack. One. The agent where the cost of error is highest and the volume is large enough that engineering effort pays back. If you have nine agents, pick the most expensive failure case. Start there.

2. Run a self-consistency probe on 100 historical inputs.

Sample the model 10 times for each. Compute pairwise agreement. Plot the distribution.

You will see the shape of your problem immediately. There will be a big mass of high-agreement cases (your auto-ship zone) and a long tail of low-agreement cases (your route-to-human zone). The shape tells you everything you need to know about whether this workflow is ready for automation, ready for hybrid routing, or not ready at all.

3. Set a threshold. Defend it with data.

Tighter threshold (say, 0.9) means more goes to humans and fewer errors slip through. Looser threshold (say, 0.7) means less human work and more risk.

The right number is empirical. Run the calibration. Pick the threshold. Move on. You can adjust it later when you have production data.

You don't need a six-month roadmap. You need an afternoon and a willingness to look at the data.

 

The Uncomfortable Truth

For two years, the AI industry has sold a false choice.

Choice A: Build fully autonomous agents and trust the hype.

Choice B: Keep a human in every loop and watch them drown.

Almost everyone working on production AI quietly knows neither one works. They just don't have a name for the third option. So they keep oscillating between the two, picking the wrong one for the wrong workflow, and ending up in next quarter's 74%.

The third option has been sitting in the statistics literature since the 1960s. Conformal prediction. Self-consistency sampling. Disagreement as a signal. We didn't invent it. We just applied it to LLMs, named it, shipped the code, and wrote it down.

Uncertainty becomes a routing signal instead of a hidden risk.

That is how AI ships in production.

Everything else is a demo.

 

In case you’re still chasing models: the model isn't the moat. The model isn't even the question.

The question is whether you've built the layer that listens to the model when it doesn't know.

Most agents don't have one. That's why they die in six months.

If yours does, it doesn't.

Have a great weekend.

— Charafeddine (CM)


↑ All editions Older →
Charafeddine Mouzouni — AI Scientist and Founder

Start with one email.