Skip to content

July 4, 2026 · Edition #90

Your AI is grading itself.

We don’t know how to test agents.


Quick heads up

My book is out.

"How to Think With AI" is now on Amazon and Apple iBooks. Paper version coming soon.

Not a tutorial (those go stale by the time they're printed). Not a hype book. 13 chapters on what actually separates the people getting extraordinary work out of AI from the people producing slop.

The subtitle is the whole pitch: Why the Same Tool Produces Genius for Some and Mediocrity for Others.

Get the book here. The French version is also available here.


In late June 1990, six weeks after Space Shuttle Discovery released it into orbit, the Hubble Space Telescope sent back its first images.

They were blurry.

Not slightly or fixable-in-software blurry. The most expensive scientific instrument in human history at the time, a $1.5 billion telescope built on three decades of anticipation, produced photographs that looked like a smudged pair of glasses. Trade press flipped overnight from "the eye of the universe" to "NASA's Hubble Trouble." Careers ended.

The cause was the primary mirror. Ground perfectly. Polished perfectly. Every quality-control measurement Perkin-Elmer recorded during two years of grinding confirmed the mirror was on-spec to within a fraction of a wavelength of light.

The measurements were wrong.

Perkin-Elmer had built the testing rig they used to certify the mirror. A specialized ruler. The rig had been assembled with a 1.3 millimeter error in one of its internal spacings. A washer in the wrong place.

For years, that flawed ruler was used to measure the mirror. Every reading confirmed the mirror matched what the ruler expected. The ruler expected the wrong curvature. The mirror was ground to be exactly that.

Two other test instruments, both cheaper, both external to Perkin-Elmer, had detected anomalies during earlier tests. Their results were dismissed as noise.

In December 1993, NASA sent seven astronauts up on Endeavour to install a corrective optics package. The servicing mission alone cost over six hundred million dollars. Perkin-Elmer stopped being NASA's optics contractor for a generation.

Every inspection had been conducted with an instrument Perkin-Elmer built. Every reading came back "exactly one instrument long."

Perkin-Elmer had not cheated. They had made an honest engineering mistake, then used their own instrument to check for it. Nobody meant to fake anything. The system failed anyway.

I have been thinking about that mirror all year.


Because we have built an entire industry where the instrument that writes the test has already read the code of the thing it is testing.


The problem you may not know exists

Every organization deploying AI is running into the same question. Is this thing safe to point at customers, at contracts, at money, at consequences?

For old-fashioned software, easy. You write a test. It passes or fails. Every behavior is decided by a human, one test at a time.

For classical machine learning, harder but tractable. Labeled dataset. Numbers come out. Everyone reads them the same way.

For large language models, the framework collapses.

There is no single correct reply to "help this customer with an expired warranty." Two competent humans would write different replies, both reasonable. The model itself changes weekly, in ways nobody can fully explain. Add an agent (a program that loops, calls tools, and takes real actions) and you are no longer grading a single answer. You are grading a process that unfolded across dozens of steps nobody watched.

So how do you know the thing is ready?

The industry has quietly converged on a menu of practices. Almost every option carries a hidden bill.

Some teams ask a coding agent to generate their test cases, using the same codebase the agent is running on. Some hire another LLM, often from the same model family as the one being tested, to grade the outputs. Some run their model against a public benchmark that may have leaked into the training data years earlier. Some skip evaluation and go by user complaints. Some update the eval set every time the model regresses.

None of these is cheating. Nobody in the industry set out to fake a measurement.

But every one of these approaches produces a confident number, and every one of these numbers has the same structural problem.

The instrument that produced the number is a little too close to the object it is trying to measure.

Which brings us back to the mirror.

The number that always agrees with you

An evaluation does not measure whether your system is good.

It measures the independence between how you made the test and the thing you are testing. That is all it measures.

When those two things are the same, the number is not a grade. It is a mirror. And a mirror always agrees with you.

If independence is zero, the number is not unreliable. It is structurally guaranteed to be flattering. A ruler designed with full knowledge of what it will be measuring will always read "exactly one ruler long."

Let me show you the shape of this with two people looking at the same agent.

Thomas versus Nadia

Thomas is Head of AI Ops at a European fintech. Good engineer, good taste, under pressure. He built a support agent for warranty questions. Monday morning, shipping week, he stood in front of his team with a number on the projector.

Ninety-six percent.

Two hundred test cases. Green almost all the way down. Someone had written "ship it?" in the team chat with a hopeful little question mark.

The problem was not the ninety-six percent. It was where the two hundred cases came from.

The Friday before, Thomas had opened Claude Code and typed: "Generate a comprehensive eval set for this agent." Claude Code had full read access to the agent's codebase: the prompts, the tools, the guardrails, the memory. It read all of that. Then it wrote two hundred test cases matched to what it had just read. Thomas ran the cases, another instance of Claude Code graded the results, and the number came back at ninety-six. He copied it to a slide.

Nobody cheated. The test suite was well-formed. The grading was consistent. And the score was structurally guaranteed to be flattering, because the tool that wrote the test had already read the code of the thing being tested. It generated questions the agent knew how to answer. Then the agent answered them. Then another LLM, drawing on the same worldview, called those answers correct.

The test was written by an AI that had read the code. The code was written by the agent's team. The grader was another AI from the same family. Nobody in that loop was outside the system.

Who grades the grader?

Three weeks after launch, the agent confidently told a customer that an expired warranty was still active, opened a replacement ticket, and triggered a shipment of a $2,400 part. The eval set had no case anywhere near that failure mode.

Of course it did not. You do not write questions when you already know which answers your student can produce.

Nadia does something that looks slower.

She pulls the cases from real history. Actual support tickets from last quarter, with the resolution the human agents actually reached. The agent never sees how those tickets were solved. The answer key is not the agent's opinion. It is what really happened, written by people who did not know an AI would later be graded against them. Where she can, ground truth is a fact, not a judgment: did the database change, did the warranty lookup return expired, did the shipment fire. Not "did the agent say it did."

Her number came back at seventy-one percent.

Thomas's ninety-six felt better in the meeting. Nadia's seventy-one is worth infinitely more, for one reason: hers can go down. A test the system can fail is the only kind of test that tells you anything.

Nadia brought her instrument from the outside. Thomas's instrument had spent Friday afternoon reading the source code.

The criminal version cost thirty billion dollars

In 2015, Volkswagen was caught programming eleven million diesel cars to recognize when they were being tested for emissions and switch to a cleaner engine mode for the duration of the test. On the road, the same car emitted up to forty times the legal limit.

VW's engineers taught the car to detect the test sequence. The moment the car detected it, a different software mode kicked in. The test-maker and the test-taker had become the same system, on purpose. The car was the ruler the car was being measured against.

VW paid over thirty billion dollars in fines and settlements. One of the most respected industrial brands in the world spent a decade rebuilding its trust.

The AI eval problem is the same structural shape without the criminal intent. Nobody at your company set out to fake a number. But when the instrument that writes the test has already read the code of the thing being tested, the number behaves the same way VW's did. The difference is that no prosecutor will show up. The dashboard just stays green until a customer notices.

Higher resolution. Lower cost per iteration. No indictments.

The uncomfortable truth

If your evaluation was designed by the team that shipped the model, tuned on data the team also chose, graded by a judge from the same model family, and stored in a dashboard whose numbers make everyone feel good, you are not measuring the model.

You are measuring how good the team is at feeling productive.

That is oversight cosplay.

Removing that conflict is not a technical problem. It is a governance one. Every mature discipline eventually figures out that the person doing the work does not get to be the sole judge of the work. External auditors for finance. Safety regulators for aircraft. Peer review for science. AI has not figured this out yet, because the numbers coming out of the current setup are too pleasant to interrupt.

A self-written eval is not fraud. It is worse.

It is a real number, generated by a real process, that measures nothing real. Nobody meant to lie. We borrowed software testing habits for a system that no longer behaves like software. And the instrument writing the test has already studied the code.

What to do Monday morning

Pick one. Not four. Run it for ninety days.

One. Separate the two hands. Whoever tunes the agent does not get to write the test it is graded on. If you can, put a different person on it entirely. The temptation to "fix the eval" when it is inconvenient is invisible and irresistible. The only defense is to put it physically out of reach.

Two. Get your cases from the world. Real logs. Real tickets. Real transcripts from after the model's training cutoff. If your eval set is generated by a coding agent with access to your codebase, you do not have an eval. You have autocomplete grading autocomplete.

Three. Make ground truth an event, not an opinion. The best ground truth is not a judgment. It is a fact. Did the database row change. Did the shipment fire. Where you need judgment, let one human be the standard, then check your automated judge against that human the way you would check any instrument.

Four. Keep a set you never look at. Tune against one pile. Report against another pile you have touched as little as possible. The gap between those two numbers is the only honest estimate of how much you have been fooling yourself.

Back to the mirror

The Allen Commission investigated Hubble and delivered its report in November 1990. The central finding has been repeated in engineering ethics classes ever since.

A measurement is worth exactly as much as the independence of the instrument that made it. When the instrument is manufactured by the same team that built the thing being measured, the measurement is a story, no matter how many decimal places it has.

Thomas reran his support agent, the slow way, against tickets it had never seen. The number came back at seventy-three percent. He stared at it for a long time. Then he said the truest thing anyone on his team had said all quarter.

"Okay. Now we know where we actually are."

That is the whole job. Not the high number. The true one.

We keep debating whether AI will replace the economy. We have not yet learned how to test it. Most of what we call "AI testing" is vibes: autocomplete grading autocomplete. Nobody is cheating. That is what makes it hard to see.

The fix is unglamorous. Build the test from the world. Keep one hand clean. Never let the tool that reads your code write the exam that judges it.

AI is only as good as the human operating it.

Have a great weekend.

Stay sharp.

— Charafeddine (CM)


↑ All editions Older →
Charafeddine Mouzouni — AI Scientist and Founder

Start with one email.