February 5, 2026. Anthropic ships Claude Opus 4.6. OpenAI ships GPT-5.3 Codex. Same day. Two major frontier model releases, both aimed squarely at developers and coding workflows.
I spent the morning pulling benchmark data from both announcements, cross-referencing with Gemini 3 Pro (Google's most recent release from November), and trying to answer a simple question: which model should I actually be using?
The answer, as usual, is "it depends." But the specifics of how it depends are genuinely interesting this time around.
The Three Contenders
| Anthropic Claude Opus 4.6 |
OpenAI GPT-5.3 Codex |
Google DeepMind Gemini 3 Pro |
|
|---|---|---|---|
| Released | Feb 5, 2026 | Feb 5, 2026 | Nov 18, 2025 |
| Pricing | $5 / $25 per M tokens | TBD | $2 / $12 per M tokens |
| Context Window | 1M (beta) | 400K tokens | 1M tokens |
| Max Output | 128K tokens | 128K tokens | 64K tokens |
| Speed | Adaptive (4 effort levels) | 25% faster; <50% tokens | Deep Think mode |
| Key Feature | Agent Teams | Self-developing | Full multimodal |
Three very different philosophies. Anthropic is pushing on reasoning depth, enterprise reliability, and agentic capabilities. OpenAI is optimizing specifically for coding speed and token efficiency. Google is betting on multimodal breadth and raw math performance.
The benchmarks reflect those priorities clearly.
Coding and Agentic Benchmarks
This is what most of you care about. I'll cut straight to the numbers.
GPT-5.3 Codex is the clear leader for terminal-based agentic coding. 77.3% is a significant gap over Opus 4.6's 65.4%. If your workflow is CLI-heavy agent tasks, this is the number that matters most.
But Terminal-Bench isn't the only coding benchmark worth tracking.
Opus holds the SWE-bench crown. Interestingly, Opus 4.6 essentially matches 4.5 here (80.8% vs 80.9%) rather than surpassing it. The improvements in 4.6 show up elsewhere.
Opus 4.6 essentially matches human performance on OSWorld at 72.7%. That's a big deal for anyone thinking about AI-driven workflow automation. GPT-5.3 Codex made an enormous jump here too, going from GPT-5.2's 37.9% to 64.7%.
Reasoning and Knowledge
Coding isn't everything. If you're using these models for analysis, problem-solving, or research, the reasoning benchmarks tell a different story.
The biggest number here is Opus 4.6's ARC-AGI-2 score: 68.8%, up from 37.6% on Opus 4.5. That's an 83% improvement in novel visual reasoning on a single generation jump. Humanity's Last Exam (53.1% with tools) also puts Opus 4.6 at the top of multidisciplinary reasoning.
But Gemini 3 Pro dominates math. 95% on AIME 2025 without tools, and 23.4% on MathArena Apex where everyone else scores around 1%. If your work involves heavy mathematical computation, that gap is hard to ignore.
Enterprise and Knowledge Work
This is where Opus 4.6 pulls ahead most convincingly. GDPval-AA Elo of 1,606, which is 144 points above GPT-5.2. Finance Agent at 60.7%. BigLaw Bench at 90.2%. BrowseComp at 84.0%.
If your team is doing knowledge-intensive work (legal analysis, financial modeling, research), Anthropic has a clear edge right now. GPT-5.3 Codex does report 70.9% wins/ties on GDPval (matching GPT-5.2), but doesn't report Elo scores or most other enterprise benchmarks — which makes sense since it's specifically a coding-focused model.
The Complete Picture
Here's the full benchmark table across all categories. A lot of cells say "?" because each vendor reports different benchmarks. That's part of the story, too: it's getting harder to compare models directly when everyone is choosing different yardsticks.
| Benchmark | Opus 4.6 | Opus 4.5 | GPT-5.3 Codex | GPT-5.2 | Gemini 3 Pro | Gemini 2.5 Pro |
|---|---|---|---|---|---|---|
| Coding & Software Engineering | ||||||
| Terminal-Bench 2.0Agentic terminal coding | 65.4% | 59.8% | 77.3% | 62.2% | 54.2% | 32.6% |
| SWE-bench VerifiedReal-world GitHub issues | 80.8% | 80.9% | ? | 80.0% | 76.2% | 59.6% |
| SWE-Bench Pro41 repos, 100+ languages | ? | ? | 56.8% | 55.6% | 43.3% | ? |
| OSWorldAgentic computer use | 72.7% | 66.3% | 64.7% | 37.9% | ? | ? |
| Aider PolyglotMulti-language coding | ? | 89.4% | ? | 88.0% | ? | ? |
| OpenRCARoot cause analysis | 34.9% | 26.9% | ? | ? | ? | ? |
| Reasoning & Knowledge | ||||||
| Humanity's Last ExamMultidisciplinary reasoning | 53.1%* | 43.2%* | ? | ? | 45.8%* | 21.6% |
| GPQA DiamondPhD-level science Q&A | 91.3% | 87.0% | ? | 93.2%¹ | 91.9% | 86.4% |
| ARC-AGI-2Novel visual reasoning | 68.8% | 37.6% | ? | 54.2%¹ | 45.1%² | 4.9% |
| AIME 2025 (no tools)Math competition | ? | 87.0% | ? | 94.6% | 95.0% | 88.0% |
| MathArena ApexHardest math problems | ? | 1.6% | ? | 1.0% | 23.4% | 0.5% |
| SimpleQA VerifiedFactual accuracy | ? | 29.3% | ? | 34.9% | 72.1% | 54.5% |
| Enterprise & Knowledge Work | ||||||
| GDPval-AA EloEconomic knowledge work | 1,606 | 1,416 | 70.9%³ | 1,462 | 1,195 | ? |
| Finance AgentFinancial analyst tasks | 60.7% | 55.9% | ? | 56.6% | 44.1% | ? |
| BigLaw BenchLegal reasoning | 90.2% | ? | ? | ? | ? | ? |
| Agentic Tasks & Long Context | ||||||
| BrowseCompFinding hard-to-locate info | 84.0% | ? | ? | ? | ? | ? |
| Vending-Bench 2Long-horizon agentic tasks | $8,018 | $4,967 | ? | $1,473 | $5,478 | $574 |
| MRCR v2 (1M, 8-needle)Long-context retrieval | 76.0% | ? | ? | ? | ? | ? |
| Multimodal & Vision | ||||||
| MMMU-ProMultimodal reasoning | 77.3%† | ? | ? | 86.5% | 81.0% | 68.0% |
| Video-MMMUVideo understanding | ? | ? | ? | 90.5% | 87.6% | 83.6% |
Sources: Anthropic — Opus 4.6 announcement · OpenAI — GPT-5.3 Codex announcement · OpenAI — GPT-5.3 Codex system card · Google — Gemini 3 Pro announcement · Google DeepMind — Gemini 3 Pro model page · OpenAI — GPT-5.2 announcement · Scale AI — SWE-Bench Pro leaderboard · Terminal-Bench 2.0 leaderboard · ARC Prize — ARC-AGI-2 leaderboard · OSWorld benchmark
What Actually Matters Here
GPT-5.3 Codex dominates terminal coding
At 77.3% on Terminal-Bench 2.0, OpenAI's Codex model is 12 points ahead of Opus 4.6. If your workflow is CLI-heavy agent tasks, this is the model to watch. It's also 25% faster and uses less than half the tokens of its predecessor for equivalent tasks.
Opus 4.6 leads SWE-bench and computer use
Claude holds the crown on SWE-bench Verified (80.8%) and OSWorld (72.7%, matching the human baseline). For general-purpose code review, debugging, and operating computers autonomously, Opus 4.6 is the strongest option.
Opus 4.6 makes a huge reasoning leap
ARC-AGI-2 jumped from 37.6% to 68.8%, an 83% improvement in one generation. Humanity's Last Exam (53.1% with tools) leads all models. This is the biggest reasoning improvement in this cycle.
Gemini 3 Pro still owns math and multimodal
95% on AIME without tools. 23.4% on MathArena Apex versus ~1% for everyone else. State-of-the-art multimodal scores. For math-heavy or vision-heavy tasks, Gemini 3 Pro remains the clear leader.
Enterprise work: Opus 4.6 by a wide margin
GDPval-AA Elo of 1,606 is 144 points ahead of GPT-5.2. Finance Agent, BigLaw Bench, BrowseComp are all Opus wins. If your work is knowledge-intensive enterprise tasks, Anthropic has a clear edge.
No single winner. Pick for your use case.
Terminal coding: GPT-5.3 Codex. GitHub bugs: Opus 4.6. Math: Gemini 3 Pro. Computer use: Opus 4.6. Vision: Gemini 3 Pro. Enterprise docs: Opus 4.6. Speed: GPT-5.3 Codex. Each has real strengths.
How Should You Think About This?
If you're an engineering leader evaluating these models for your team, here's my take: stop looking for a single answer.
The era of "one model to rule them all" is over. Each of these models has genuine, measurable strengths in different domains. The right approach is to match the model to the task. Use GPT-5.3 Codex for your agentic terminal workflows. Use Opus 4.6 for code review, debugging, and knowledge work. Use Gemini 3 Pro when you need math or multimodal capabilities.
If that sounds like more complexity to manage, it is. But it's also where the leverage is. The teams that figure out which model to use when will outperform the teams that pick one and stick with it.
One more thing worth noting: benchmarks are a starting point, not a final answer. I've seen models that score lower on benchmarks outperform on specific codebases because of how they handle context, follow instructions, or match a team's coding patterns. Always validate on your own workloads.
These numbers will change again soon. That's how this works now.
Frequently Asked Questions
Need help figuring out which AI models fit your engineering team's workflow?
Let's Talk →