Claude Opus 4.7: More Coding Power, More Pixels, and a Hint at Mythos
Anthropic releases Claude Opus 4.7 today, with its strongest coding performance to date, three and a half megapixels of vision, and a new effort level that didn’t exist before. And on the side: a hint at something even more powerful.
New model versions now arrive so regularly that you can feel the fatigue. Every model is the “most capable yet,” every benchmark table shows red arrows pointing up. Still, Opus 4.7 is worth a closer look, because some of the improvements here are genuinely substantial.
The Key Points
Opus 4.7 is Anthropic’s new flagship model, replacing Opus 4.6 as the recommended model for demanding tasks. The price remains unchanged: 5 dollars per million input tokens, 25 dollars per million output tokens. The model is available immediately through Anthropic’s own Claude products, the API, as well as Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. For those working directly with the API, the identifier is claude-opus-4-7.
What has changed in substance can be roughly divided into four areas: coding, vision, instruction-following, and new control mechanisms.
Coding: The Most Notable Advance
Anthropic explicitly positions Opus 4.7 as a model for difficult software development tasks, and the numbers back that up. On an internal 93-task benchmark, the model achieves a 13 percent lift over its predecessor. On CursorBench, a practical benchmark for autonomous coding agent behavior, the hit rate jumps from 58% to 70%. Anthropic also describes 4.7 as solving three times more production tasks independently than its predecessor. Four of the 93 internal tasks could not be solved by either Opus 4.6 or Sonnet 4.6. On top of that, it achieves state-of-the-art on the GDPval-AA evaluation and on a finance agent benchmark.
In practice, this means: Complex, multi-step tasks where Opus 4.6 would still abort or take a wrong turn — such as refactoring across many files, parallel tool calls, or debugging with incomplete context — run more reliably with 4.7. The model also more actively checks its own outputs before returning results. What exactly that means under the hood, Anthropic doesn’t describe. But you can feel it.
In external comparison, Opus 4.7 currently leads on SWE-bench Pro with 64.3% versus GPT-5.4 at 57.7%. On the classic SWE-bench Verified, GPT-5.4 still holds a minimal edge (74.9% vs. ~74%), but the advantage of the new version is likely to shift toward harder, more realistic tasks — precisely where the new benchmark places more weight.
Vision: From Postage Stamp to Poster
The improvement in image processing is the most surprising number in Anthropic’s announcement. Opus 4.7 accepts images with up to 2,576 pixels on the long edge, corresponding to roughly 3.75 megapixels. For comparison: predecessor versions worked with about one megapixel. A factor of nearly four in usable image resolution.
On a visual accuracy benchmark, the hit rate jumps from 54.5% on Opus 4.6 to 98.5% on Opus 4.7. That sounds dramatic at first — and it is, at least in the context of computer-use agents that need to read screen content. Specifically, the model can now reliably read chemical structural formulas and technical diagrams. For diagram analysis, screenshot debugging, and UI reference work, this is a direct improvement in everyday usability.
Anthropic also describes the model as having become stylistically stronger when creating professional content — interfaces, presentations, and documents. Subjectively measurable, but anyone who used Opus 4.6 for design tasks should notice the difference fairly quickly.
Instruction-Following: More Precise, Sometimes Surprising
Opus 4.7 interprets instructions more literally than previous versions. Anthropic explicitly notes that existing prompts that ran on the predecessor may need recalibration. What initially sounds like a regression is actually a quality improvement. The model makes fewer autonomous interpretations and follows more precisely what was actually written — not what might have been intended.
For developers using prompts in production, this is a double-edged sword: more predictable on one hand, but 4.6 was more forgiving of imprecise task descriptions in some places. It’s worth systematically reviewing your own system prompts when upgrading to 4.7.
For document understanding, 4.7 also delivers 21% fewer errors than its predecessor on OfficeQA Pro, a benchmark for extracting information from structured business documents like Excel spreadsheets or PowerPoint slides.
New Control Mechanisms
Three new features relevant for regular API users:
xhigh Effort Level. Anthropic introduces a new level xhigh between the existing high and max tiers. This gives finer control over the tradeoff between reasoning depth and latency on difficult tasks, without going straight into the full token budget of max. In Claude Code, xhigh is now the default for all plans.
Task Budgets (Public Beta). Opus 4.7 allows budgeting token expenditure for individual tasks. In agentic workflows that autonomously run through multiple steps, token control was previously coarse — either the entire context window or a manual abort. Task budgets give the model a guideline: how much effort is appropriate here.
/ultrareview Command. New in Claude Code: a dedicated command for code review sessions that puts the model into a mode where it actively searches for bugs and design issues — not just answering, but proactively analyzing. Pro and Max users receive three free ultrareviews.
Tokenizer and Token Consumption
A detail that’s easy to miss: Opus 4.7 uses an updated tokenizer. The same input generates 1.0 to 1.35x more tokens than the predecessor, depending on content. Anthropic states that the net effect on internal coding evaluations is positive, but recommends measuring actual token consumption on your own traffic before adjusting budgets. Generally, the model produces more output tokens at higher effort levels, especially in later turns of agent-heavy sessions. This improves reliability on difficult problems but costs correspondingly more.
Safety and Cyber-Safeguards
In its safety profile, Opus 4.7 stays close to its predecessor: low rates for deception, sycophancy, and cooperation with misuse. On honesty and resistance to prompt injection attacks, 4.7 actually performs better than 4.6. Anthropic names one small weakness itself: the model provides slightly more detailed harm-reduction guidance on controlled substances than desired.
In the cybersecurity domain, capabilities are deliberately kept below Mythos Preview. High-risk cybersecurity requests are automatically detected and blocked. For legitimate security researchers — penetration testers, red teamers, and vulnerability researchers — there is a new Cyber Verification Program that unlocks targeted access.
Performance Comparison
| Benchmark | Claude Opus 4.7 | Claude Opus 4.6 | GPT-5.4 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-bench Pro (autonomous coding) | 64.3% | ~55% | 57.7% | n/a |
| SWE-bench Verified | ~74% | 74% | 74.9% | 63.8% |
| CursorBench (agentic coding) | 70% | 58% | n/a | n/a |
| GPQA Diamond (reasoning) | 94.2% | ~92% | 94.4% | 94.3% |
| Visual accuracy (computer use) | 98.5% | 54.5% | n/a | n/a |
| OfficeQA Pro (document analysis) | -21% errors vs. 4.6 | Baseline | n/a | n/a |
| Long-context research (0-1) | 0.715 | n/a | n/a | n/a |
| Price input / output ($/M tokens) | $5 / $25 | $5 / $25 | ~$10 / $30 | $2 / $12 |
Sources: Anthropic, The Next Web, LM Council (April 2026). n/a = no official value available. Benchmarks are not fully comparable — different test sets and versions.
The Unspoken Topic: Mythos
In its announcement, Anthropic did something rather unusual: the company confirmed that a model called Mythos exists, and that this model surpasses Opus 4.7 in capabilities. At the same time, it communicated that Mythos is not yet being released for safety reasons.
The story behind Mythos, however, is more complex than a single line in a release blog post suggests. A sandbox escape in a controlled test, thousands of claimed zero-days based on 198 manually reviewed cases, an exclusive partner program called Project Glasswing, and the question of where the line runs between genuine security risk and strategic staging. For those who want to read the details: the analysis on Claude Mythos assesses it based on 18 sources.
When Mythos arrives is open. That it will arrive is certain.
Sources
Translated with the help of Claude