Opus 4.6 and GPT-5.3 Codex Dropped on the Same Day. Here's How They Compare.

Q: Which AI model is best for coding in February 2026?

It depends on what kind of coding. GPT-5.3 Codex leads Terminal-Bench 2.0 (77.3%) for agentic terminal-based coding. Claude Opus 4.6 leads SWE-bench Verified (80.8%) for real-world GitHub issue resolution and OSWorld (72.7%) for computer use. There is no single best coding model — pick based on your workflow.

Q: How does Claude Opus 4.6 compare to GPT-5.3 Codex?

GPT-5.3 Codex dominates agentic terminal coding (77.3% vs 65.4% on Terminal-Bench 2.0) and is 25% faster than its predecessor. Opus 4.6 leads on SWE-bench Verified (80.8%), OSWorld computer use (72.7%), reasoning benchmarks like ARC-AGI-2 (68.8%), and enterprise tasks (GDPval-AA Elo of 1,606). Each model has distinct strengths.

Q: Where does Gemini 3 Pro fit in the AI model landscape?

Gemini 3 Pro (released November 2025) leads in math (95% AIME without tools, 23.4% MathArena Apex) and multimodal tasks (87.6% Video-MMMU). It's the best choice for math-heavy, vision-heavy, or multimodal workloads, though it trails in coding-specific benchmarks compared to Opus 4.6 and GPT-5.3 Codex.

Q: Should I switch AI coding models based on these benchmarks?

Benchmarks are a starting point, not a final answer. Run your own evaluations on your actual codebase and workflows. The model that scores highest on a benchmark may not perform best on your specific stack, codebase size, or coding patterns. Use these numbers to narrow your options, then test.

February 5, 2026. Anthropic ships Claude Opus 4.6. OpenAI ships GPT-5.3 Codex. Same day. Two major frontier model releases, both aimed squarely at developers and coding workflows.

I spent the morning pulling benchmark data from both announcements, cross-referencing with Gemini 3 Pro (Google's most recent release from November), and trying to answer a simple question: which model should I actually be using?

The answer, as usual, is "it depends." But the specifics of how it depends are genuinely interesting this time around.

The Three Contenders

	Anthropic Claude Opus 4.6	OpenAI GPT-5.3 Codex	Google DeepMind Gemini 3 Pro
Released	Feb 5, 2026	Feb 5, 2026	Nov 18, 2025
Pricing	$5 / $25 per M tokens	TBD	$2 / $12 per M tokens
Context Window	1M (beta)	400K tokens	1M tokens
Max Output	128K tokens	128K tokens	64K tokens
Speed	Adaptive (4 effort levels)	25% faster; <50% tokens	Deep Think mode
Key Feature	Agent Teams	Self-developing	Full multimodal

Three very different philosophies. Anthropic is pushing on reasoning depth, enterprise reliability, and agentic capabilities. OpenAI is optimizing specifically for coding speed and token efficiency. Google is betting on multimodal breadth and raw math performance.

The benchmarks reflect those priorities clearly.

Coding and Agentic Benchmarks

This is what most of you care about. I'll cut straight to the numbers.

Opus 4.6

Opus 4.5

GPT-5.3 Codex

GPT-5.2

Gemini 3 Pro

Gemini 2.5 Pro

Terminal-Bench 2.0

Agentic coding via terminal — the gold standard for coding agents (higher = better)

GPT-5.3 Codex

77.3%

Opus 4.6

65.4%

GPT-5.2 Codex

64.0%

GPT-5.2

62.2%

Opus 4.5

59.8%

Gemini 3 Pro

54.2%

Gemini 2.5 Pro

32.6%

GPT-5.3 Codex is the clear leader for terminal-based agentic coding. 77.3% is a significant gap over Opus 4.6's 65.4%. If your workflow is CLI-heavy agent tasks, this is the number that matters most.

But Terminal-Bench isn't the only coding benchmark worth tracking.

SWE-bench Verified

Real-world GitHub issue resolution — the industry standard (Python-focused)

Opus 4.5

80.9%

Opus 4.6

80.8%

GPT-5.2

80.0%

Gemini 3 Pro

76.2%

Gemini 2.5 Pro

59.6%

GPT-5.3 Codex

Not reported

Why no GPT-5.3 Codex score? OpenAI shifted focus to SWE-Bench Pro for Codex models, calling SWE-bench Verified "Python-only" and positioning SWE-Bench Pro as "more contamination-resistant, challenging, diverse and industry-relevant."

Opus holds the SWE-bench crown. Interestingly, Opus 4.6 essentially matches 4.5 here (80.8% vs 80.9%) rather than surpassing it. The improvements in 4.6 show up elsewhere.

SWE-Bench Pro (Multi-Language)

Harder, contamination-resistant — 41 repos across 100+ languages, not just Python

GPT-5.3 Codex

56.8%

GPT-5.2 Codex

56.4%

GPT-5.2

55.6%

Gemini 3 Pro

43.3%

Opus 4.6

Not reported

Opus 4.5

Not reported

Why no Anthropic scores? Anthropic hasn't published SWE-Bench Pro results. SWE-Bench Pro is a newer benchmark introduced alongside GPT-5.2. Anthropic focuses on Terminal-Bench 2.0 and SWE-bench Verified as primary coding metrics.

OSWorld — Agentic Computer Use

Completing productivity tasks in a visual desktop environment (human baseline ~72%)

Opus 4.6

72.7%

Opus 4.5

66.3%

GPT-5.3 Codex

64.7%

GPT-5.2 Codex

38.2%

GPT-5.2

37.9%

Gemini 3 Pro

Not reported

Why no Gemini score? Google didn't report an OSWorld score. They reported 72.7% on ScreenSpot-Pro (a different screen-understanding benchmark) and focus on their own benchmarks for agentic tasks.

Opus 4.6 essentially matches human performance on OSWorld at 72.7%. That's a big deal for anyone thinking about AI-driven workflow automation. GPT-5.3 Codex made an enormous jump here too, going from GPT-5.2's 37.9% to 64.7%.

Reasoning and Knowledge

Coding isn't everything. If you're using these models for analysis, problem-solving, or research, the reasoning benchmarks tell a different story.

The biggest number here is Opus 4.6's ARC-AGI-2 score: 68.8%, up from 37.6% on Opus 4.5. That's an 83% improvement in novel visual reasoning on a single generation jump. Humanity's Last Exam (53.1% with tools) also puts Opus 4.6 at the top of multidisciplinary reasoning.

But Gemini 3 Pro dominates math. 95% on AIME 2025 without tools, and 23.4% on MathArena Apex where everyone else scores around 1%. If your work involves heavy mathematical computation, that gap is hard to ignore.

Enterprise and Knowledge Work

This is where Opus 4.6 pulls ahead most convincingly. GDPval-AA Elo of 1,606, which is 144 points above GPT-5.2. Finance Agent at 60.7%. BigLaw Bench at 90.2%. BrowseComp at 84.0%.

If your team is doing knowledge-intensive work (legal analysis, financial modeling, research), Anthropic has a clear edge right now. GPT-5.3 Codex does report 70.9% wins/ties on GDPval (matching GPT-5.2), but doesn't report Elo scores or most other enterprise benchmarks — which makes sense since it's specifically a coding-focused model.

The Complete Picture

Here's the full benchmark table across all categories. A lot of cells say "?" because each vendor reports different benchmarks. That's part of the story, too: it's getting harder to compare models directly when everyone is choosing different yardsticks.

Benchmark	Opus 4.6	Opus 4.5	GPT-5.3 Codex	GPT-5.2	Gemini 3 Pro	Gemini 2.5 Pro
Coding & Software Engineering
Terminal-Bench 2.0Agentic terminal coding	65.4%	59.8%	77.3%	62.2%	54.2%	32.6%
SWE-bench VerifiedReal-world GitHub issues	80.8%	80.9%	?	80.0%	76.2%	59.6%
SWE-Bench Pro41 repos, 100+ languages	?	?	56.8%	55.6%	43.3%	?
OSWorldAgentic computer use	72.7%	66.3%	64.7%	37.9%	?	?
Aider PolyglotMulti-language coding	?	89.4%	?	88.0%	?	?
OpenRCARoot cause analysis	34.9%	26.9%	?	?	?	?
Reasoning & Knowledge
Humanity's Last ExamMultidisciplinary reasoning	53.1%*	43.2%*	?	?	45.8%*	21.6%
GPQA DiamondPhD-level science Q&A	91.3%	87.0%	?	93.2%¹	91.9%	86.4%
ARC-AGI-2Novel visual reasoning	68.8%	37.6%	?	54.2%¹	45.1%²	4.9%
AIME 2025 (no tools)Math competition	?	87.0%	?	94.6%	95.0%	88.0%
MathArena ApexHardest math problems	?	1.6%	?	1.0%	23.4%	0.5%
SimpleQA VerifiedFactual accuracy	?	29.3%	?	34.9%	72.1%	54.5%
Enterprise & Knowledge Work
GDPval-AA EloEconomic knowledge work	1,606	1,416	70.9%³	1,462	1,195	?
Finance AgentFinancial analyst tasks	60.7%	55.9%	?	56.6%	44.1%	?
BigLaw BenchLegal reasoning	90.2%	?	?	?	?	?
Agentic Tasks & Long Context
BrowseCompFinding hard-to-locate info	84.0%	?	?	?	?	?
Vending-Bench 2Long-horizon agentic tasks	$8,018	$4,967	?	$1,473	$5,478	$574
MRCR v2 (1M, 8-needle)Long-context retrieval	76.0%	?	?	?	?	?
Multimodal & Vision
MMMU-ProMultimodal reasoning	77.3%†	?	?	86.5%	81.0%	68.0%
Video-MMMUVideo understanding	?	?	?	90.5%	87.6%	83.6%

Notes: * = with tools/search. † = with tools. ¹ = GPT-5.2 Pro variant (standard GPT-5.2 scores slightly lower). ² = Gemini 3 Pro Deep Think (base model scores 31.1%). ³ = GPT-5.3 Codex reports GDPval as win rate (70.9% wins/ties vs. professionals), not Elo — matching GPT-5.2's win rate. ? = not reported or not directly comparable. GPT-5.3 Codex is coding-focused and doesn't report separate general reasoning benchmarks.

Sources: Anthropic — Opus 4.6 announcement · OpenAI — GPT-5.3 Codex announcement · OpenAI — GPT-5.3 Codex system card · Google — Gemini 3 Pro announcement · Google DeepMind — Gemini 3 Pro model page · OpenAI — GPT-5.2 announcement · Scale AI — SWE-Bench Pro leaderboard · Terminal-Bench 2.0 leaderboard · ARC Prize — ARC-AGI-2 leaderboard · OSWorld benchmark

What Actually Matters Here

GPT-5.3 Codex dominates terminal coding

At 77.3% on Terminal-Bench 2.0, OpenAI's Codex model is 12 points ahead of Opus 4.6. If your workflow is CLI-heavy agent tasks, this is the model to watch. It's also 25% faster and uses less than half the tokens of its predecessor for equivalent tasks.

Opus 4.6 leads SWE-bench and computer use

Claude holds the crown on SWE-bench Verified (80.8%) and OSWorld (72.7%, matching the human baseline). For general-purpose code review, debugging, and operating computers autonomously, Opus 4.6 is the strongest option.

Opus 4.6 makes a huge reasoning leap

ARC-AGI-2 jumped from 37.6% to 68.8%, an 83% improvement in one generation. Humanity's Last Exam (53.1% with tools) leads all models. This is the biggest reasoning improvement in this cycle.

Gemini 3 Pro still owns math and multimodal

95% on AIME without tools. 23.4% on MathArena Apex versus ~1% for everyone else. State-of-the-art multimodal scores. For math-heavy or vision-heavy tasks, Gemini 3 Pro remains the clear leader.

Enterprise work: Opus 4.6 by a wide margin

GDPval-AA Elo of 1,606 is 144 points ahead of GPT-5.2. Finance Agent, BigLaw Bench, BrowseComp are all Opus wins. If your work is knowledge-intensive enterprise tasks, Anthropic has a clear edge.

No single winner. Pick for your use case.

Terminal coding: GPT-5.3 Codex. GitHub bugs: Opus 4.6. Math: Gemini 3 Pro. Computer use: Opus 4.6. Vision: Gemini 3 Pro. Enterprise docs: Opus 4.6. Speed: GPT-5.3 Codex. Each has real strengths.

How Should You Think About This?

If you're an engineering leader evaluating these models for your team, here's my take: stop looking for a single answer.

The era of "one model to rule them all" is over. Each of these models has genuine, measurable strengths in different domains. The right approach is to match the model to the task. Use GPT-5.3 Codex for your agentic terminal workflows. Use Opus 4.6 for code review, debugging, and knowledge work. Use Gemini 3 Pro when you need math or multimodal capabilities.

If that sounds like more complexity to manage, it is. But it's also where the leverage is. The teams that figure out which model to use when will outperform the teams that pick one and stick with it.

One more thing worth noting: benchmarks are a starting point, not a final answer. I've seen models that score lower on benchmarks outperform on specific codebases because of how they handle context, follow instructions, or match a team's coding patterns. Always validate on your own workloads.

These numbers will change again soon. That's how this works now.

Frequently Asked Questions

Which AI model is best for coding in February 2026?

It depends on what kind of coding. GPT-5.3 Codex leads Terminal-Bench 2.0 (77.3%) for agentic terminal-based coding. Claude Opus 4.6 leads SWE-bench Verified (80.8%) for real-world GitHub issue resolution and OSWorld (72.7%) for computer use. There is no single best coding model. Pick based on your workflow.

How does Claude Opus 4.6 compare to GPT-5.3 Codex?

GPT-5.3 Codex dominates agentic terminal coding (77.3% vs 65.4% on Terminal-Bench 2.0) and is 25% faster than its predecessor. Opus 4.6 leads on SWE-bench Verified (80.8%), OSWorld computer use (72.7%), reasoning benchmarks like ARC-AGI-2 (68.8%), and enterprise tasks (GDPval-AA Elo of 1,606). Each model has distinct strengths depending on the task.

Where does Gemini 3 Pro fit in the AI model landscape?

Gemini 3 Pro (released November 2025) leads in math (95% AIME without tools, 23.4% MathArena Apex) and multimodal tasks (87.6% Video-MMMU). It's the best choice for math-heavy, vision-heavy, or multimodal workloads, though it trails in coding-specific benchmarks.

Should I switch AI coding models based on these benchmarks?

Benchmarks are a starting point, not a final answer. Run your own evaluations on your actual codebase and workflows. The model that scores highest on a benchmark may not perform best on your specific stack, codebase size, or coding patterns. Use these numbers to narrow your options, then test.

Browse all FAQs →

Dan Rummel is the founder of Fibonacci Labs, where he coaches engineering leaders on AI strategy and team performance. He spent Feb 5th pulling benchmark data instead of actually using the models. Priorities.

Need help figuring out which AI models fit your engineering team's workflow?

Let's Talk →

Opus 4.6 and GPT-5.3 Codex Dropped on the Same Day Here's How They Compare

The Three Contenders

Coding and Agentic Benchmarks

Terminal-Bench 2.0

SWE-bench Verified

SWE-Bench Pro (Multi-Language)

OSWorld — Agentic Computer Use

Reasoning and Knowledge

Enterprise and Knowledge Work

The Complete Picture

What Actually Matters Here

GPT-5.3 Codex dominates terminal coding

Opus 4.6 leads SWE-bench and computer use

Opus 4.6 makes a huge reasoning leap

Gemini 3 Pro still owns math and multimodal

Enterprise work: Opus 4.6 by a wide margin

No single winner. Pick for your use case.

How Should You Think About This?

Frequently Asked Questions