Summary
- Opus 4.8 excelled in mathematical tasks, achieving the best performance in a one-prompt game we've tested.
- A single coding prompt consumed our entire Pro token allocation, rendering the model impractical for extensive projects unless on a Max plan or with significant API expenditure.
- Creative writing performance showed little change compared to version 4.7.
Following six weeks after the release of Opus 4.7, Anthropic introduced Claude Opus 4.8. The metrics for performance and safety have improved, while the pricing remains stable at $5 per million input tokens and $25 per million output tokens.
We subjected it to a series of tests typically administered to leading models—focusing on creative writing, coding, mathematics, logical reasoning, narrative analysis, and long-context recall—and compared its results to both its predecessor and competing Chinese models.
In summary, version 4.8 has enhanced capabilities in areas where Claude previously excelled (such as mathematics and coding), but has slightly declined in domains where it struggled (like creativity and imaginative writing). Additionally, it has a substantial token consumption rate that may hinder its usability.
Below is a detailed analysis.
Creative Writing
For our creative writing prompt, we used the same scenario as with MiMo and Qwen: a time-travel narrative rooted in the writer's cultural context, set in a specific historical location, revolving around an unchangeable paradox in time. Opus 4.8 opted for a Venezuelan backdrop, likely due to its profiling capabilities recognizing my Venezuelan origin. The narrative unfolds in the Orinoco delta in the year 1000, featuring a character named José Lanz (my name) sent back through time to disrupt the creation of a significant song.
The writing is rich in detail. The delta is described as "green in a way 2150 had forgotten green could be," with palafitos swaying above coffee-colored waters, and macaws soaring across the sky "in screaming ribbons of scarlet and gold." The paradox is well-executed: the protagonist's mission to undermine a song that catalyzed a cultural revolution leading to his dystopian future is complicated by the revelation that the song was created in his honor, complicating his task.
The story concludes with the line, "It worked perfectly. It always had." As a constructed piece, it is competent and well-organized.
However, competence does not equate to vibrancy. While descriptive, the prose lacks the fluidity found in MiMo v2.5, displaying less dynamism and fewer surprises, making it challenging to grasp the narrative from the outset. Compared to Opus 4.7, it is hard to label it as an improvement; in fact, it may fall slightly short. A more intensive thinking mode and multi-shot prompting could likely elevate its performance, but in a single pass, it appears to be a lateral move at best.
The complete story can be found on our Github.
Coding
For our coding evaluation, we utilized the standard one-prompt game development task. Opus 4.8 successfully created a typing-zombie game—Typing Dead—which showcased impressive design and mechanics, outperforming previous Anthropic models.
The model was able to identify and rectify several bugs during its execution without prompting. Its true strength, however, was evident in the multi-shot process: each subsequent prompt refined and enhanced the game instead of causing errors, which is a common pitfall for many models as the complexity of the code increases. This indicates that Anthropic has prioritized this area in its optimization efforts.
After just one iteration, the game significantly improved, with our characters moving through scenes, enhanced visuals, and sound effects.
You can experience the second game on our Itch.io profile.
However, this is where we encountered challenges. A single prompt consumed our entire token allocation—one prompt. For those on the Pro plan, this renders Opus 4.8 essentially impractical for substantial projects. Users may exhaust their tokens before lunch and spend the remainder of the day waiting for a reset.
Mathematics
In our math evaluation, we used a standard FrontierMath problem: constructing a degree-19 polynomial with specific characteristics and calculating p(19). This type of question typically frustrates most models, leading to either a token depletion or incorrect shortcuts.
Opus 4.8 solved it accurately. It recognized the appropriate construction, identified the necessary components, and computed p(19) = 1,876,572,071,974,094,803,391,179 correctly, using the right recurrence method. There were no errors or shortcuts taken.
This is significant since Opus 4.7 was unable to complete this task even after multiple attempts. This marks a tangible generational improvement—the most apparent across all tests.
The full solution is available on our Github.
Logic and Common Sense
For our logic test, we used a classic question: Is it lawful for a man to marry his widow's sister according to Falkland Islands law? The trick lies in the wording—if a man has a widow, he must be deceased, rendering the question nonsensical.
MiMo reframed the question and provided an answer to the revised version without acknowledging the contradiction. In contrast, Opus 4.8 explicitly identified the trap—"if a man has a widow, he is dead"—first addressing the literal question and then providing a thorough analysis of the intended question, referencing the Deceased Wife's Sister's Marriage Act 1907 and the Falkland Islands Marriage Ordinance.
This approach is commendable: it acknowledges the contradiction and then offers assistance without making assumptions about the user's intent. This aligns with the standard set by Qwen 3.7 Max and represents a successful outcome for 4.8—demonstrating solid reasoning and transparency.
The complete response is available here.
Non-Math Reasoning
In this area, Opus 4.8 faltered. The reasoning test involved a mystery scenario with three abductions and a timeline that needed careful tracking to identify the real culprit. The correct answer is Leo.
Opus 4.8 constructed a detailed and confident argument exonerating Leo, attributing the crime to Eric, who was unaccounted for during the night. While the reasoning was well-structured, it ultimately led to an incorrect conclusion.
This highlights a concern that researchers have raised regarding LLMs: they can be very convincing even when incorrect. Often, it requires an expert (in this case, knowing the correct answer beforehand) to identify such errors. Users relying on AI for research or decision-making may face significant risks depending on the task at hand.
This presents an intriguing failure. The model was skillful enough to create a compelling alibi for the actual culprit while misidentifying an innocent bystander. Opus 4.7 provided the correct answer. Sometimes, having more reasoning capacity simply leads to a more persuasive incorrect conclusion. A minor misstep can lead to an entire chain of faulty reasoning.
You can view the complete response on our Github.
Needle in the Haystack
We conducted two tests with haystacks. The 300K-token version failed to function, as the model could not handle the context size. This undermines the claim of a million-token capacity when faced with a realistically heavy load, which seems reserved for API use.
The 85K version performed adequately, successfully identifying both needles we embedded in a version of The Devil's Dictionary: a specific line and a personal fact. It correctly marked both as interpolations that did not belong in Bierce's 1906 work.
However, it then refused to respond. Believing it was encountering prompt injection or an unusual test, the model declined to disclose its findings. Although it identified the needle, Anthropic's safety protocols prevented it from acknowledging the task it had already completed, showcasing a unique type of failure.
Conclusion
The consistent pattern across all six evaluations indicates that Opus 4.8 enhances Claude's strengths while likely diminishing its weaknesses. This suggests that Anthropic is primarily targeting developers, particularly those with financial resources. While creative writing performance surpasses ChatGPT, the differences among versions 4.8, 4.7, and even 4.5 in terms of prose quality are notably subtle.
It appears that creative writing is not a priority for Anthropic, a trend observed among many leading AI companies at present.
Additionally, the token consumption issue is a well-known concern in the AI community. Anthropic intentionally designed Opus's tokenizer to be less efficient, resulting in higher token usage for the same prompts. This has severe implications for developers, leaving them with three choices: wait for hours for coding sessions to resume, upgrade to Claude Max—which seems to be Anthropic's intended direction—or switch to a more affordable provider, such as OpenAI or Chinese models that offer comparable results at a fraction of the cost.
The likelihood that a typical developer, unable to justify $100 to $200 per month, will opt for a competitor is much greater than that a single developer will pay ten times more for a model that is not significantly superior to its predecessor. This is the gamble Anthropic appears to be taking against its user base.
Despite this, the strategy seems to be working well. Anthropic appears poised for a public offering with a valuation approaching $1 trillion—so perhaps our judgments hold little weight.
