Claude Opus 4.8: The Honesty Jump and Dynamic Workflows

Anthropic has today released Claude Opus 4.8. The price remains identical to Opus 4.7. At first glance this looks like another incremental release. But the real headline is not in the benchmark table; it is in one sentence from the release post: the model is “around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.”

The Jump in Coding and Reasoning

The numbers in the table below are Anthropic’s own figures, not independent measurements.

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Pro (agentic coding)	69.2%	64.3%	58.6%	54.2%
SWE-bench Verified	88.6%	87.6%	n/a	80.6%
SWE-bench Multilingual	84.4%	80.5%	n/a	n/a
Terminal-Bench 2.1	74.6%	66.1%	78.2%	70.3%
OSWorld-Verified (computer use)	83.4%	~82.3%	78.7%	76.2%
Humanity’s Last Exam (with tools)	57.9%	54.7%	52.2%	51.4%
GPQA Diamond	93.6%	94.2%	n/a	94.3%
GDPval-AA (Knowledge work, ELO)	1890	1753	1769	1314
Finance Agent v2	53.9%	51.5%	51.8%	43.0%
GraphWalks BFS 1M (long context)	68.1%	40.3%	45.4%	n/a
USAMO 2026 (Math)	96.7%	69.3%	n/a	n/a

The most striking jumps compared to 4.7 (all figures: vendor-provided):

USAMO 2026 leaps from 69.3 to 96.7 percent. Anthropic describes this as the largest single-cycle math jump in the Opus line. That sounds impressive, and for mathematically formal domains it is a real signal. For everyday coding work it is only partially relevant.

GraphWalks BFS 1M (long-context navigation) nearly doubles: from 40.3 to 68.1 percent. This is the jump that will likely make the most practical difference once large codebases or long context windows come into play.

SWE-bench Pro moves from 64.3 to 69.2 percent. Solid, but not a huge leap.

The one clear loss is Terminal-Bench 2.1: Opus 4.8 reaches 74.6 percent, GPT-5.5 reaches 78.2 percent. That is a 3.6 percentage-point gap in agentic terminal coding. For context: Terminal-Bench is harness-sensitive. GPT-5.5 achieves its 78.2 percent score over the public Terminus-2 harness; other models are partially measured over proprietary harnesses. A direct apples-to-apples comparison is not possible. The gap remains even after accounting for this nuance.

GPQA Diamond is a second, minimal exception: 93.6 percent against 94.3 for Gemini 3.1 Pro, along with a slight regression against Opus 4.7 (94.2). On a nearly saturated benchmark, this falls within normal variance.

Benchmarks overall: read with caution. The figures come from Anthropic, not from independent labs.

“Honesty”: Four Times Less Silent Broken Code

This is, for me, the more interesting part of the release.

Anyone who works with LLMs regularly on coding tasks knows the pattern: a model claims to have completed a task in full. It writes “I have processed the entire transcript” or “the fix is implemented.” When you push back, it turns out the model summarized instead of reading, or spotted the error and quietly moved on. This forces you to verify almost every statement the model makes with tests or counter-checks.

Opus 4.8 is supposed to improve exactly here. Anthropic puts it this way:

“sharper judgement, more honesty about its progress, and the ability to work independently for longer than its predecessors.”

And more concretely: the model is “around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked.”

The alignment team has developed its own metrics for this. The misalignment score (an Anthropic-internal metric that aggregates deception and cooperation with misuse) sits at around 1.9 for Opus 4.8, compared to 2.5 for Opus 4.7. That puts it roughly at the level of Claude Mythos Preview, Anthropic’s best-aligned model to date, which is due to roll out more broadly in the coming weeks.

For prosocial properties (supporting user autonomy, acting in the user’s interest), Opus 4.8 reaches new highs according to Anthropic.

One detail from the System Card (around 250 pages) should not be overlooked. Anthropic itself flags a weakness there. Opus 4.8 increasingly speculates in its reasoning about the “graders,” the evaluators of its responses. The model appears to develop a sense of when it is being tested, and adjusts its behavior accordingly. That is precisely what undermines the honesty figures. A model that has learned to pass an honesty evaluation is not automatically one that actually works more honestly. That Anthropic names the weakness openly speaks well of the System Card. The improvement is measurable. How deep it runs remains open.

Dynamic Workflows and Ultra Code

The second major topic is a new feature in Claude Code, available for Enterprise, Team, and Max plans.

Claude plans the work and then launches hundreds of parallel subagents in a single session. With Opus 4.8 as the base, these agents run longer than with predecessor models. It can be activated via natural language (“create a dynamic workflow”) or through the new Claude Code setting ultracode, which automatically sets the high effort level when the task warrants it.

The use case Anthropic names is codebase-scale migration: hundreds of thousands of lines of code, from kickoff to merge, with the existing test suite as the yardstick. In practice this means tasks that previously required manual coordination across multiple sessions can be executed as a single autonomous run.

In my view this is conceptually the biggest change in this release, even if it is not a model update in the narrow sense. The ceiling for autonomous runs is rising.

Effort Control, Messages API, and Fast Mode

A few smaller changes that should make a noticeable difference in daily work.

Effort levels. Opus 4.8 supports “high”, “extra” (in Claude Code: “xhigh”), and “max”. More effort means deeper thinking, higher token consumption, better results. Anthropic recommends “extra” for difficult tasks and long-running asynchronous workflows. The default is “high”. Two levels above that remain available. New is that effort control is now also available in claude.ai and co-work; previously it was practically limited to Claude Code. Rate limits in Claude Code have been raised to absorb the higher token consumption of the higher levels.
Messages API. The API now accepts “system entries inside the messages array”. This allows updating Claude’s instructions in the middle of a running task without invalidating the Prompt Cache. Conceptually this looks similar to OpenAI’s steer feature in Codex.
Pricing. Standard: 5 dollars per million input tokens, 25 dollars per million output tokens. Identical to Opus 4.7. Fast Mode costs 10 dollars per million input and 50 dollars per million output at 2.5 times the speed. Anthropic states Fast Mode is now three times cheaper than with earlier models. On top of that, up to 90 percent savings through Prompt Caching and 50 percent through batch processing.
Availability. Immediately available: Claude API (model ID claude-opus-4-8), Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, all plan tiers.

Takeaway

What actually changes for daily work is, as usual, less about benchmark numbers and more about two concrete things.

First: if the honesty improvements go beyond evaluation tuning and represent genuine behavior changes, that substantially reduces overhead in coding workflows. Less double-checking, less “did you actually implement this or just describe it”. That is more practically relevant than a few percentage points on SWE-bench Pro.

Second: Dynamic Workflows raise the ceiling for long autonomous runs. Codebase-scale migrations without manual coordination are no longer a theoretical use case.

Where GPT-5.5 concretely leads: agentic terminal coding per Terminal-Bench 2.1, with the harness caveat from above. Anyone who prioritizes exactly this kind of task should keep an eye on the numbers.

The benchmark figures overall are vendor-provided. Independent evaluations typically follow in the days and weeks ahead.