Skip to content

May 10, 2025 · Edition #34

o1 vs o3 vs o4: I Tested Them So You Don’t Have To

The “Best AI Model” might be just the one you’ve already ignored. I put OpenAI’s o3, o4-mini, Claude 3.7, Gemini 2.5, and 4o head-to-head. One clear winner. And it’s not the one you think.


Why new models?

Have you ever asked this question?

You’ve been using your favorite AI for a while and then… AI providers just decide to flood the market with “new ones”.

I’ve been trying to write this letter for weeks now.

Every time I sat down to analyze the latest “genius” AI model, three more would drop before I could finish typing.

Just look at the past couple of weeks:

Monday: Claude-3.7 rewrites the game.

A few days later: Gemini 2.5 Pro is the new standard.

A few days later: OpenAI’s o3 and o4 are going to smash everyone else. (They didn’t.)

A few days later: Actually, the new standard is GPT-4.1. Not 4.5.

...

Each announcement bigger than the last. Each claim more outrageous than before. The tech news cycle turned into an endless stream of "forget everything you knew about AI" headlines.

o1 vs o3 vs o4: I Tested Them So You Don’t Have To

So I waited.

Finally, a ceasefire. Maybe.

A rare week without a new launch.

Just long enough to write this before it gets outdated (maybe by the time you read this).

Let’s talk about what’s really happening.

⚡ The AI Arms Race Just Leveled Up

A couple of weeks ago, OpenAI released two new reasoning modelsgpt-o4-mini and gpt-o3 (for paid subscriptions).

To be very precise, these are not exactly LLMs (or models); for me, these new creations are more like agents (they auto-reflect and search the web, etc.) than just "a model" per se.

People are already calling them “genius-level.”

(which isn't completely absurd, by the way)

But if that’s 100% true, we should be riding superconducting hoverboards by next week.

I would call them "powerful" or "potent," not really "genius"...

Because every few weeks, someone drops a "Genius Model" — impressive for solving many search, information retrieval, and "moderate problems," and great for being your "therapist," but usually failing badly at solving more complex, multi-task problems or doing complete end-to-end tasks with an "acceptable" level of "trust."

o1 vs o3 vs o4: I Tested Them So You Don’t Have To

Still… OpenAI and other AI providers are shipping like mad.

Let’s zoom out for a second:

OpenAI

DateRelease
February 27, 2025GPT-4.5 research preview OpenAI
April 14, 2025GPT-4.1 series API-only models OpenAIOpenAI
April 16, 2025OpenAI o3 & o4-mini with full tool access OpenAIOpenAI
April 16, 2025OpenAI Codex CLI (command-line coding agent) OpenAI

Anthropic

DateRelease
February 24, 2025Claude 3.7 Sonnet hybrid reasoning model & Claude Code CLI preview Anthropic
May 1, 2025“Claude can now connect to your world” Integrations beta Anthropic
May 5, 2025Introducing Anthropic’s “AI for Science” program Anthropic
May 7, 2025Web search on the Anthropic API (feature preview) Anthropic

Google

DateRelease
March 25, 2025Introducing Gemini 2.5 (Pro Experimental) thinking model blog.google
April 4, 2025Start building with Gemini 2.5 Pro (public preview in AI Studio) blog.google
April 8, 2025Deep Research capability on Gemini 2.5 Pro Experimental blog.google
April 17, 2025Gemini 2.5 Flash preview (fast, cost-efficient thinking model) blog.google

Confused by the naming? You’re not alone.

We’re talking about o4, not 4o

Keep up. It’s a full-time job now.

🪓 Tools Everywhere, Focus Nowhere

Let’s take coding — one of the top AI use cases.

The dev tool space is insane right now.


← Newer ↑ All editions Older →
Charafeddine Mouzouni — AI Scientist and Founder

Start with one email.