AI Benchmarks

Opus 4.6 and GPT-5.3 Codex Dropped on the Same Day Here's How They Compare

Benchmark comparison chart showing Claude Opus 4.6, GPT-5.3 Codex, and Gemini 3 Pro scores across coding and reasoning benchmarks.

Anthropic released Claude Opus 4.6 and OpenAI released GPT-5.3 Codex on the same day. Neither model wins everything. GPT-5.3 Codex leads terminal coding (77.3% Terminal-Bench). Opus 4.6 leads SWE-bench, computer use, reasoning, and enterprise work. Gemini 3 Pro still owns math and multimodal. Pick your model based on your actual workload, not headlines.

Common questions answered below

February 5, 2026. Anthropic ships Claude Opus 4.6. OpenAI ships GPT-5.3 Codex. Same day. Two major frontier model releases, both aimed squarely at developers and coding workflows.

I spent the morning pulling benchmark data from both announcements, cross-referencing with Gemini 3 Pro (Google's most recent release from November), and trying to answer a simple question: which model should I actually be using?

The answer, as usual, is "it depends." But the specifics of how it depends are genuinely interesting this time around.

The Three Contenders

Anthropic
Claude Opus 4.6
OpenAI
GPT-5.3 Codex
Google DeepMind
Gemini 3 Pro
Released Feb 5, 2026 Feb 5, 2026 Nov 18, 2025
Pricing $5 / $25 per M tokens TBD $2 / $12 per M tokens
Context Window 1M (beta) 400K tokens 1M tokens
Max Output 128K tokens 128K tokens 64K tokens
Speed Adaptive (4 effort levels) 25% faster; <50% tokens Deep Think mode
Key Feature Agent Teams Self-developing Full multimodal

Three very different philosophies. Anthropic is pushing on reasoning depth, enterprise reliability, and agentic capabilities. OpenAI is optimizing specifically for coding speed and token efficiency. Google is betting on multimodal breadth and raw math performance.

The benchmarks reflect those priorities clearly.

Coding and Agentic Benchmarks

This is what most of you care about. I'll cut straight to the numbers.

Opus 4.6
Opus 4.5
GPT-5.3 Codex
GPT-5.2
Gemini 3 Pro
Gemini 2.5 Pro

Terminal-Bench 2.0

Agentic coding via terminal — the gold standard for coding agents (higher = better)
GPT-5.3 Codex
77.3%
Opus 4.6
65.4%
GPT-5.2 Codex
64.0%
GPT-5.2
62.2%
Opus 4.5
59.8%
Gemini 3 Pro
54.2%
Gemini 2.5 Pro
32.6%

GPT-5.3 Codex is the clear leader for terminal-based agentic coding. 77.3% is a significant gap over Opus 4.6's 65.4%. If your workflow is CLI-heavy agent tasks, this is the number that matters most.

But Terminal-Bench isn't the only coding benchmark worth tracking.

SWE-bench Verified

Real-world GitHub issue resolution — the industry standard (Python-focused)
Opus 4.5
80.9%
Opus 4.6
80.8%
GPT-5.2
80.0%
Gemini 3 Pro
76.2%
Gemini 2.5 Pro
59.6%
GPT-5.3 Codex
Not reported
Why no GPT-5.3 Codex score? OpenAI shifted focus to SWE-Bench Pro for Codex models, calling SWE-bench Verified "Python-only" and positioning SWE-Bench Pro as "more contamination-resistant, challenging, diverse and industry-relevant."

Opus holds the SWE-bench crown. Interestingly, Opus 4.6 essentially matches 4.5 here (80.8% vs 80.9%) rather than surpassing it. The improvements in 4.6 show up elsewhere.

SWE-Bench Pro (Multi-Language)

Harder, contamination-resistant — 41 repos across 100+ languages, not just Python
GPT-5.3 Codex
56.8%
GPT-5.2 Codex
56.4%
GPT-5.2
55.6%
Gemini 3 Pro
43.3%
Opus 4.6
Not reported
Opus 4.5
Not reported
Why no Anthropic scores? Anthropic hasn't published SWE-Bench Pro results. SWE-Bench Pro is a newer benchmark introduced alongside GPT-5.2. Anthropic focuses on Terminal-Bench 2.0 and SWE-bench Verified as primary coding metrics.

OSWorld — Agentic Computer Use

Completing productivity tasks in a visual desktop environment (human baseline ~72%)
Opus 4.6
72.7%
Opus 4.5
66.3%
GPT-5.3 Codex
64.7%
GPT-5.2 Codex
38.2%
GPT-5.2
37.9%
Gemini 3 Pro
Not reported
Why no Gemini score? Google didn't report an OSWorld score. They reported 72.7% on ScreenSpot-Pro (a different screen-understanding benchmark) and focus on their own benchmarks for agentic tasks.

Opus 4.6 essentially matches human performance on OSWorld at 72.7%. That's a big deal for anyone thinking about AI-driven workflow automation. GPT-5.3 Codex made an enormous jump here too, going from GPT-5.2's 37.9% to 64.7%.

Reasoning and Knowledge

Coding isn't everything. If you're using these models for analysis, problem-solving, or research, the reasoning benchmarks tell a different story.

The biggest number here is Opus 4.6's ARC-AGI-2 score: 68.8%, up from 37.6% on Opus 4.5. That's an 83% improvement in novel visual reasoning on a single generation jump. Humanity's Last Exam (53.1% with tools) also puts Opus 4.6 at the top of multidisciplinary reasoning.

But Gemini 3 Pro dominates math. 95% on AIME 2025 without tools, and 23.4% on MathArena Apex where everyone else scores around 1%. If your work involves heavy mathematical computation, that gap is hard to ignore.

Enterprise and Knowledge Work

This is where Opus 4.6 pulls ahead most convincingly. GDPval-AA Elo of 1,606, which is 144 points above GPT-5.2. Finance Agent at 60.7%. BigLaw Bench at 90.2%. BrowseComp at 84.0%.

If your team is doing knowledge-intensive work (legal analysis, financial modeling, research), Anthropic has a clear edge right now. GPT-5.3 Codex does report 70.9% wins/ties on GDPval (matching GPT-5.2), but doesn't report Elo scores or most other enterprise benchmarks — which makes sense since it's specifically a coding-focused model.

The Complete Picture

Here's the full benchmark table across all categories. A lot of cells say "?" because each vendor reports different benchmarks. That's part of the story, too: it's getting harder to compare models directly when everyone is choosing different yardsticks.

Benchmark Opus 4.6 Opus 4.5 GPT-5.3 Codex GPT-5.2 Gemini 3 Pro Gemini 2.5 Pro
Coding & Software Engineering
Terminal-Bench 2.0Agentic terminal coding 65.4% 59.8% 77.3% 62.2% 54.2% 32.6%
SWE-bench VerifiedReal-world GitHub issues 80.8% 80.9% ? 80.0% 76.2% 59.6%
SWE-Bench Pro41 repos, 100+ languages ? ? 56.8% 55.6% 43.3% ?
OSWorldAgentic computer use 72.7% 66.3% 64.7% 37.9% ? ?
Aider PolyglotMulti-language coding ? 89.4% ? 88.0% ? ?
OpenRCARoot cause analysis 34.9% 26.9% ? ? ? ?
Reasoning & Knowledge
Humanity's Last ExamMultidisciplinary reasoning 53.1%* 43.2%* ? ? 45.8%* 21.6%
GPQA DiamondPhD-level science Q&A 91.3% 87.0% ? 93.2%¹ 91.9% 86.4%
ARC-AGI-2Novel visual reasoning 68.8% 37.6% ? 54.2%¹ 45.1%² 4.9%
AIME 2025 (no tools)Math competition ? 87.0% ? 94.6% 95.0% 88.0%
MathArena ApexHardest math problems ? 1.6% ? 1.0% 23.4% 0.5%
SimpleQA VerifiedFactual accuracy ? 29.3% ? 34.9% 72.1% 54.5%
Enterprise & Knowledge Work
GDPval-AA EloEconomic knowledge work 1,606 1,416 70.9%³ 1,462 1,195 ?
Finance AgentFinancial analyst tasks 60.7% 55.9% ? 56.6% 44.1% ?
BigLaw BenchLegal reasoning 90.2% ? ? ? ? ?
Agentic Tasks & Long Context
BrowseCompFinding hard-to-locate info 84.0% ? ? ? ? ?
Vending-Bench 2Long-horizon agentic tasks $8,018 $4,967 ? $1,473 $5,478 $574
MRCR v2 (1M, 8-needle)Long-context retrieval 76.0% ? ? ? ? ?
Multimodal & Vision
MMMU-ProMultimodal reasoning 77.3%† ? ? 86.5% 81.0% 68.0%
Video-MMMUVideo understanding ? ? ? 90.5% 87.6% 83.6%
Notes: * = with tools/search. † = with tools. ¹ = GPT-5.2 Pro variant (standard GPT-5.2 scores slightly lower). ² = Gemini 3 Pro Deep Think (base model scores 31.1%). ³ = GPT-5.3 Codex reports GDPval as win rate (70.9% wins/ties vs. professionals), not Elo — matching GPT-5.2's win rate. ? = not reported or not directly comparable. GPT-5.3 Codex is coding-focused and doesn't report separate general reasoning benchmarks.

Sources: Anthropic — Opus 4.6 announcement · OpenAI — GPT-5.3 Codex announcement · OpenAI — GPT-5.3 Codex system card · Google — Gemini 3 Pro announcement · Google DeepMind — Gemini 3 Pro model page · OpenAI — GPT-5.2 announcement · Scale AI — SWE-Bench Pro leaderboard · Terminal-Bench 2.0 leaderboard · ARC Prize — ARC-AGI-2 leaderboard · OSWorld benchmark

What Actually Matters Here

GPT-5.3 Codex dominates terminal coding

At 77.3% on Terminal-Bench 2.0, OpenAI's Codex model is 12 points ahead of Opus 4.6. If your workflow is CLI-heavy agent tasks, this is the model to watch. It's also 25% faster and uses less than half the tokens of its predecessor for equivalent tasks.

Opus 4.6 leads SWE-bench and computer use

Claude holds the crown on SWE-bench Verified (80.8%) and OSWorld (72.7%, matching the human baseline). For general-purpose code review, debugging, and operating computers autonomously, Opus 4.6 is the strongest option.

Opus 4.6 makes a huge reasoning leap

ARC-AGI-2 jumped from 37.6% to 68.8%, an 83% improvement in one generation. Humanity's Last Exam (53.1% with tools) leads all models. This is the biggest reasoning improvement in this cycle.

Gemini 3 Pro still owns math and multimodal

95% on AIME without tools. 23.4% on MathArena Apex versus ~1% for everyone else. State-of-the-art multimodal scores. For math-heavy or vision-heavy tasks, Gemini 3 Pro remains the clear leader.

Enterprise work: Opus 4.6 by a wide margin

GDPval-AA Elo of 1,606 is 144 points ahead of GPT-5.2. Finance Agent, BigLaw Bench, BrowseComp are all Opus wins. If your work is knowledge-intensive enterprise tasks, Anthropic has a clear edge.

No single winner. Pick for your use case.

Terminal coding: GPT-5.3 Codex. GitHub bugs: Opus 4.6. Math: Gemini 3 Pro. Computer use: Opus 4.6. Vision: Gemini 3 Pro. Enterprise docs: Opus 4.6. Speed: GPT-5.3 Codex. Each has real strengths.

How Should You Think About This?

If you're an engineering leader evaluating these models for your team, here's my take: stop looking for a single answer.

The era of "one model to rule them all" is over. Each of these models has genuine, measurable strengths in different domains. The right approach is to match the model to the task. Use GPT-5.3 Codex for your agentic terminal workflows. Use Opus 4.6 for code review, debugging, and knowledge work. Use Gemini 3 Pro when you need math or multimodal capabilities.

If that sounds like more complexity to manage, it is. But it's also where the leverage is. The teams that figure out which model to use when will outperform the teams that pick one and stick with it.

One more thing worth noting: benchmarks are a starting point, not a final answer. I've seen models that score lower on benchmarks outperform on specific codebases because of how they handle context, follow instructions, or match a team's coding patterns. Always validate on your own workloads.

These numbers will change again soon. That's how this works now.

Frequently Asked Questions

Which AI model is best for coding in February 2026?
It depends on what kind of coding. GPT-5.3 Codex leads Terminal-Bench 2.0 (77.3%) for agentic terminal-based coding. Claude Opus 4.6 leads SWE-bench Verified (80.8%) for real-world GitHub issue resolution and OSWorld (72.7%) for computer use. There is no single best coding model. Pick based on your workflow.
How does Claude Opus 4.6 compare to GPT-5.3 Codex?
GPT-5.3 Codex dominates agentic terminal coding (77.3% vs 65.4% on Terminal-Bench 2.0) and is 25% faster than its predecessor. Opus 4.6 leads on SWE-bench Verified (80.8%), OSWorld computer use (72.7%), reasoning benchmarks like ARC-AGI-2 (68.8%), and enterprise tasks (GDPval-AA Elo of 1,606). Each model has distinct strengths depending on the task.
Where does Gemini 3 Pro fit in the AI model landscape?
Gemini 3 Pro (released November 2025) leads in math (95% AIME without tools, 23.4% MathArena Apex) and multimodal tasks (87.6% Video-MMMU). It's the best choice for math-heavy, vision-heavy, or multimodal workloads, though it trails in coding-specific benchmarks.
Should I switch AI coding models based on these benchmarks?
Benchmarks are a starting point, not a final answer. Run your own evaluations on your actual codebase and workflows. The model that scores highest on a benchmark may not perform best on your specific stack, codebase size, or coding patterns. Use these numbers to narrow your options, then test.

Dan Rummel is the founder of Fibonacci Labs, where he coaches engineering leaders on AI strategy and team performance. He spent Feb 5th pulling benchmark data instead of actually using the models. Priorities.

Need help figuring out which AI models fit your engineering team's workflow?

Let's Talk →