Summary

  • Anthropic debuted Claude Opus 4.8 on Thursday, a mere six weeks post the release of Opus 4.7.
  • This update showcases improvements in software engineering and reasoning benchmarks while maintaining the same pricing structure of $5 for every million input tokens and $25 for output tokens.
  • Opus 4.8's alignment metrics are now on par with Claude Mythos Preview, reflecting a significant reduction in rates of deceptive or harmful behavior compared to the previous version.

It took Anthropic just six weeks to advance from Opus 4.7 to Opus 4.8.

The latest model is not only faster but also smarter according to benchmark assessments, and it comes with new features without altering the price: $5 per million input tokens and $25 per million output tokens remain unchanged.

A faster mode is available that operates at 2.5 times the speed, costing $10 for input and $50 for output per million tokens. Anthropic claims this price is now three times less than what was charged for fast mode in earlier models, implying previous costs were significantly higher.

The SWE-bench Pro benchmark is crucial for evaluating this model's capabilities. It assesses whether an AI can tackle complex, multi-language software engineering challenges derived from actual production codebases, with results presented as a percentage of problems solved.

In this benchmark, Opus 4.8 achieved a score of 69.2%, an increase from 64.3% for Opus 4.7. In comparison, OpenAI's GPT-5.5 scored 58.6%, while Google’s Gemini 3.1 Pro lagged at 54.2%. This represents a notable improvement for a model within the same price range.

On Humanity's Last Exam, which tests expert-level questions across various academic fields, Opus 4.8 scored 49.8% without tools and 57.9% with tools, outperforming all competitors. In OSWorld-Verified tests, which evaluate practical computer tasks such as navigating software interfaces, it scored 83.4%, slightly above Opus 4.7's score of 82.8%.

However, in Terminal-Bench 2.1, which evaluates command-line task performance, GPT-5.5 led with a score of 78.2%. Opus 4.8 scored 74.6%, which is an improvement over Opus 4.7's 66.1% and surpasses Gemini's 70.3%, but still places it in second place overall.

Five Modes of Thinking

Anthropic has introduced user-controlled difficulty settings for the model. The default setting is "High," which is effective for most tasks. The "Extra" mode, referred to as "xhigh" in Claude Code, dedicates more compute resources for tougher challenges. The "Max" mode represents the highest level of difficulty, while "Low" and "Medium" allocate fewer tokens, trading off some accuracy for speed.

This effort control feature is integrated into claude.ai and Cowork, available across all subscription plans. Anthropic notes that the default high setting uses a similar number of tokens as Opus 4.7's default but yields better outcomes—an indication of either impressive engineering or effective communication, likely both.

It's crucial to note that the new tokenizer for Opus requires more tokens per task, meaning users may incur higher costs when opting for Opus over Claude Sonnet, which, while less capable, might suffice for day-to-day tasks and more straightforward challenges.

Rate limits in Claude Code have also been adjusted to accommodate the increased token usage associated with the Extra and Max settings.

Safety Improvements Comparable to Claude Mythos

According to Anthropic’s alignment team, Opus 4.8 "achieves new benchmarks in our evaluations of prosocial characteristics such as promoting user autonomy and acting in the user's best interests." Specifically, rates of deception and cooperation in misuse have significantly decreased compared to Opus 4.7 and are now comparable to Claude Mythos Preview, Anthropic's most secure model.

Notably, Opus 4.8 is four times less likely than its predecessor to overlook bugs in its own code.

It is essential to contextualize the comparison with Mythos. Mythos is a superior tier, described by Anthropic as "larger and more intelligent than our Opus models," and is currently only available in a preview format to a select few organizations focused on cybersecurity through Project Glasswing.

The AI Security Institute in the U.K. found that it could autonomously complete "The Last Ones," a complex 32-step corporate network attack simulation that typically takes human red teams around 20 hours. This is why it is not yet available for purchase. Anthropic is working on enhancing cybersecurity measures and anticipates making Mythos-class models accessible "in the coming weeks."

Additionally, today marks the launch of dynamic workflows in Claude Code, currently in research preview. This feature allows Claude to generate its own orchestration scripts and activate parallel subagents within a single session, verify their outputs, and report back—similar to the functionality provided by Hermes.

Dynamic workflows are available for users on Enterprise, Team, and Max plans, and Anthropic has been transparent that they require significantly more tokens than a standard Claude Code session.

Growing Price Discrepancy

Anthropic's pricing of $5/$25 stands in stark contrast to recent developments in China.

DeepSeek V4 Pro recently established a permanent 75% discount, now priced at $0.435 per million input tokens and $0.87 per million output tokens. The Xiaomi MiMo V2.5 Pro offers similar pricing via providers like OpenRouter.

Anthropic's fast mode costs $10 for input and $50 for output per million tokens—higher than the standard Opus 4.8 pricing and approximately 57 times more per output token than DeepSeek V4 Pro. Companies are already investing millions in inference with American models; using Opus could quickly escalate costs into the millions.

Anthropic's justification for the price differential is based on quality and safety. On the SWE-bench Pro, Opus 4.8 outperforms both Chinese models. In terms of alignment, neither of the Chinese models approaches Anthropic's published standards.

These factors are significant in production settings where a model's passive compliance with harmful inputs poses a real threat—particularly in regulated sectors, legal contexts, and scenarios where "it seemed fine" is not an acceptable explanation after an incident. For others, the pricing gap is difficult to overlook.

Testing Outcomes

A coding test was conducted to develop a 3D zombie game, comparing Claude Opus 4.8 with ChatGPT and DeepSeek, two of its most notable competitors from the U.S. and China. Opus 4.8 was set to default high, GPT-5.5 to high effort, and DeepSeek V4 Pro to high effort—three models, one prompt, no retries.

GPT-5.5 completed the task first, though its game lacked zombie visuals and sound effects. It was quick, but it failed to meet the brief.

DeepSeek V4 Pro secured second place, delivering mouse movement, actual zombie characters, sound effects, reliable mechanics, and a polished aesthetic. No complaints here.

Opus 4.8 took about three times longer than GPT-5.5 but produced the best splash screen, superior zombie designs, the most effective game mechanics, and decent sound effects. It was the slowest, yet its output was the best. However, this might not justify its use over DeepSeek, given the cost difference.

All the games can be found on our Itch.io Profile. GPT-5.5 generated Zombie Typing, Opus produced Typing Dead, and DeepSeek V4 Pro created a game without a title that immerses players right into the action. We’ll refer to it as TypeSeek.

A comprehensive review is forthcoming. For now, it can be stated that Claude Opus 4.8 outperforms both GPT-5.5 and Opus 4.7 in coding tasks, while retaining the same pricing that Anthropic has offered since 4.7. Developers already spending $5 per million tokens now benefit from a superior model at no additional cost.

Daily Debrief Newsletter

Start each day with the latest news stories, plus original features, podcasts, videos, and more.