GPT-5.5 Reality Check: Benchmarks, Pricing, and What OpenAI Isn't Telling Us
Behind the Release Hype
On April 23, 2026, OpenAI released GPT-5.5, just one week after Anthropic’s Claude Opus 4.7. The cadence is remarkable: five model releases in a bit more than six months (GPT-5.1, 5.2, 5.3-Codex, 5.4, 5.5), and the pressure on OpenAI has been palpable. Since December 2025, the company has reportedly been in a “Code Red” state, while Anthropic’s enterprise ARR has allegedly grown from 9 to 30 billion US dollars. GPT-5.5 is the response to that pressure.
The marketing narrative is familiar: “a new class of intelligence” (Greg Brockman), “our smartest model” (OpenAI blog), “a step toward AGI”. If we take the benchmarks seriously and read the context around them, a more nuanced picture emerges.
The Technical Facts, Sober Look
GPT-5.5 is, according to OpenAI’s own statements, the first fully retrained base model since GPT-4.5. This isn’t a point update, it is a structural reorientation toward what OpenAI calls “agentic performance”: longer task chains, less prompt micromanagement, autonomous computer use.
Technically, several levers are pulled at once. The model runs on NVIDIA GB200 and GB300-NVL72 systems, uses heuristic algorithms written by AI itself for load balancing across GPU cores, and per VentureBeat achieves over 20 % higher token generation speed than its predecessor. At the same per-token latency as GPT-5.4, despite higher intelligence. Context size is 1 million tokens via the API, matching Claude Opus 4.7, which has shipped with the same 1M window since its predecessor 4.6 (February 2026). Anthropic has, since March 2026, even dropped the price surcharge for contexts beyond 200K. A detail that becomes relevant when comparing with OpenAI’s pricing model, where the input price doubles above 272K. Two modes are available for GPT-5.5: Standard and Pro, with Pro using parallel test-time compute, correspondingly slower but reasoning more deeply.
So much for the official description. How solid are the numbers?
Benchmarks Where GPT-5.5 Leads
In the Artificial Analysis Intelligence Index, GPT-5.5 reaches 60 points, three ahead of Claude Opus 4.7 and Gemini 3.1 Pro Preview (both 57). A measurable but not overwhelming lead. Things get more interesting in the specialist disciplines:
| Benchmark | GPT-5.5 | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7 % | 69.4 % | 68.5 % |
| GDPval (knowledge work) | 84.9 % | 80.3 % | 67.3 % |
| OSWorld-Verified | 78.7 % | 78.0 % | — |
| Expert-SWE (internal) | 73.1 % | — | — |
| SWE-Bench Pro | 58.6 % | 64.3 % | — |
| Humanity’s Last Exam (no tools) | 41.4 % | 46.9 % | — |
| BrowseComp | 84.4 % (Pro: 90.1 %) | 79.3 % | 85.9 % |
| FrontierMath Tier 1–3 | 51.7 % | 43.8 % | — |
| CyberGym | 81.8 % | — | — |
VentureBeat’s summary: GPT-5.5 leads in 14 benchmarks, Opus 4.7 in 4, Gemini 3.1 Pro in 2. The dominance sits clearly in agentic computer use, knowledge work (GDPval), cybersecurity, and advanced mathematics.
Where Claude Opus 4.7 Still Wins
The numbers tell only half the story. On SWE-Bench Pro, one of the most meaningful benchmarks for real software engineering tasks, Claude Opus 4.7 leads GPT-5.5 clearly with 64.3 % vs. 58.6 %. OpenAI’s own commentary points to possible memorization effects, but that doesn’t change the leaderboard position. Anyone refactoring legacy codebases or closing real GitHub issues will still find Opus 4.7 the stronger tool.
Even more striking: on Humanity’s Last Exam without tools, Opus 4.7 scores 46.9 %, GPT-5.5 only 41.4 %. That is pure, zero-shot academic reasoning, no tool-use crutches. When the task is one where the model itself has to think, with no search engine or code execution in the loop, Anthropic still has the edge.
VentureBeat’s analyst Carl Franzen puts it well: GPT-5.5 dominates in “agentic computer use, economic knowledge work, specialized cybersecurity, and complex mathematics”, while Opus 4.7 leads in software engineering and pure reasoning without tools.
Pricing That Can Hurt
Here begins the part OpenAI would rather not dwell on. The API prices:
- GPT-5.5 Standard: 5 USD per 1M input tokens, 30 USD per 1M output tokens
- GPT-5.5 Pro: 180 USD per 1M output tokens
- GPT-5.4 for comparison: 2.50 USD / 15 USD
- Claude Opus 4.7: 5 USD / 25 USD (unchanged from 4.6)
OpenAI has therefore doubled pricing compared to GPT-5.4, and GPT-5.5 Pro costs six times as much per output token as the Standard model. The official counter-argument: GPT-5.5 uses roughly 40 % fewer output tokens for comparable tasks, so the net cost increase should “only” be about 20 %.
There is another, often overlooked point: on GPT-5.x, OpenAI doubles the price above 272K input tokens. So anyone using the full 1M context window effectively pays 10 USD per million input tokens. Anthropic has removed this surcharge entirely since March 2026. A 900K-token request at Claude costs the same per-token price as a 9K request. For agent pipelines processing long documents or full codebases, that is a significant structural advantage, and one of the reasons Anthropic is growing so strongly in the enterprise market.
This is a clean accounting argument, but in practice a risky one. Anyone running agents in production knows: token usage doesn’t scale linearly with model quality. A model that “thinks deeper” can quickly use significantly more tokens, not fewer, especially with poorly specified prompts. The claimed efficiency math holds for OpenAI’s internal benchmarks, not automatically for our production workloads.
One more point: API availability isn’t there yet. GPT-5.5 currently runs in ChatGPT and Codex for paying users, but the API rollout follows “shortly”. OpenAI justifies this with additional safety work related to the “High” cybersecurity capability rating.
The System Card, in the Fine Print
The GPT-5.5 System Card is more revealing than the marketing page. OpenAI classifies the cybersecurity capabilities of GPT-5.5 as “High” in the Preparedness Framework for the first time, a level that requires additional safeguards. Concretely: CyberGym score rises from 79.0 (GPT-5.4) to 81.8 %, internal Capture-the-Flag challenges from 83.7 to 88.1 %.
What’s interesting is what OpenAI explicitly excludes: the model does not possess the ability “to develop functional zero-day exploits of all severity levels in hardened real-world critical systems without human intervention”. The wording is carefully chosen, it negates the critical threshold but leaves considerable room. Against the backdrop that Anthropic’s Mythos preview (not publicly available) reaches 83.1 % on the same CyberGym benchmark, both companies are deliberately operating at the same edge.
For bio and chemistry risks, GPT-5.5 remains at the “High” capability level standard since GPT-5-thinking. Safeguards are essentially carried over.
Also notable is what the System Card says about bias evaluations: on the harm_overall metric (gender-dependent response differences with male vs. female names), GPT-5.5 is “on par with GPT-5.1 and within the confidence interval of GPT-5.2 and GPT-5.4”. Translated: on bias reduction, there is no progress. OpenAI phrases it politely.
Lessons from Five Releases
A look back is instructive here. GPT-5.2 launched in December 2025 with similar superlatives, and Sam Altman had to openly admit at a townhall that OpenAI had deliberately neglected writing quality and narrative flow in order to prioritize “Intelligence, Reasoning, Coding”. The Reddit and HackerNews communities tore 5.2 apart for exactly those traits: prudish, mechanical, hallucinated APIs, forgotten contract clauses in long documents.
Whether GPT-5.5 fixes these problems or makes them worse, we’ll only see in the coming weeks. Early testers are still to be taken with a grain of salt given the short availability window, even though OpenAI claims 200 early-access partners. Remember: at the GPT-5 launch in August 2025, publicly visible failures (incorrect U.S. maps, wrong presidential lists, letter-counting errors) were documented within hours. The “near-AGI” label didn’t survive 24 hours.
The auto-router architecture, which since GPT-5 switches between fast and “thinking” variants, has also been a recurring target of quality complaints. Users reported that supposed “Thinking” responses actually came from smaller, cheaper models. With GPT-5.5, a new “xhigh” reasoning tier joins the mix. More levers mean more places where transparency can be lost.
The Strategic Layer
Anyone reading Brockman’s press conference carefully spots two messages. First: “The model itself is no longer the whole product. You can think of it as a brain, but we’re also building the body, in the form of the applications we ship, the agentic harnesses.” OpenAI is positioning GPT-5.5 as a building block for a “super-app” that is meant to bundle ChatGPT, Codex, and the Atlas browser agent into a single session. That explains why the focus on agentic benchmarks is so strong, and why API availability waits: OpenAI wants to position its own products first.
Second: Jakub Pachocki, Chief Scientist, says OpenAI “has significant headroom left” to train noticeably smarter models. The scaling debate, which after GPT-5 was considered closed, is implicitly reopened here. Pachocki is careful enough not to name timelines.
Which Model for Which Workload
We can summarize the situation fairly clearly:
If our main work consists of agent workflows with tool use, computer operation, and multi-step tasks, GPT-5.5 is currently the best publicly available option. Gains on Terminal-Bench, OSWorld-Verified, and BrowseComp are substantial. The Pro variant in particular, at 90.1 % BrowseComp, is impressive for deep-research pipelines.
If we are instead writing productive software engineering code, doing legacy refactoring, or working on pure reasoning tasks, Claude Opus 4.7 remains competitive to superior. The SWE-Bench Pro lead and the lower token pricing structure speak for themselves. On top: the /ultrareview command in Claude Code and the extended task budgets on the Claude platform are concrete, production-ready features.
For independent developers, small teams, and SMBs, the price trajectory is a serious factor. Anthropic keeps Opus 4.7 pricing stable at 5/25 USD, OpenAI moves to 5/30 and 180 USD for Pro. On productive workloads with thousands of daily requests, this adds up to five-figure monthly bills quickly.
A Word on Benchmark Culture
We shouldn’t forget the limits of these comparisons. Terminal-Bench 2.0, OSWorld-Verified, and GDPval are not laws of nature, they are evaluations with their own assumptions, harnesses, and scoring logic. A model can look better in an evaluation because the scaffold is better, the context manager smarter, the retry logic cleaner. OpenAI mentions in its own notes on GPT-5.4 that BrowseComp scores reflect not only model changes, but also changes in the search system and the state of the web.
The honest framing is therefore: GPT-5.5 is, by today’s publicly accessible benchmarks, the strongest frontier model, but in a field where the leading pack changes position within weeks. A month ago it was GPT-5.4 Pro, then Opus 4.7, now GPT-5.5. In six weeks, Anthropic or Google will likely respond.
Verdict
GPT-5.5 is a solid release, in agentic scenarios an outstanding one, that visibly raises the pressure on Anthropic and Google. The marketing rhetoric (“new class of intelligence”) is overdone, but the actual gains, especially on OSWorld, Terminal-Bench, and GDPval, are real and measurable.
At the same time, three structural problems emerge. The price doubling shifts the market against independent developers and smaller teams. The missing API availability at launch time points to safety concerns OpenAI is not communicating transparently enough. And the stagnation on bias and fairness metrics visible in the System Cards reminds us that “smarter” does not automatically mean “better”.
Anyone building a productive system today should not switch blindly to the newest model. The question isn’t which model fills the longest benchmark table, but which offers the best combination of quality, price, and reliability for our specific workload. For many scenarios that will still be Claude Opus 4.7, and for some, GPT-5.5. The era of “the one best AI” is definitively over, if it ever existed.
Sources and Further Reading
- OpenAI: Introducing GPT-5.5 (23.04.2026)
- GPT-5.5 System Card (OpenAI Deployment Safety Hub)
- VentureBeat: OpenAI’s GPT-5.5 is here, and it’s no potato
- The New Stack: OpenAI launches GPT-5.5, calling it “a new class of intelligence”
- Fast Company: OpenAI releases GPT-5.5, a more powerful engine for coding, science, and general work
- The Next Web: OpenAI launches GPT-5.5, its first fully retrained base model since GPT-4.5
- Anthropic: Introducing Claude Opus 4.7 (16.04.2026)
- Digital Applied: GPT-5.5 Complete Guide
- all-ai.de: GPT-5.5 stronger and more expensive than Claude Opus
Translated from the German original with the help of Claude.