Summary
- Anthropic acknowledged that its previously hidden safeguards for LLM development were "the wrong tradeoff" and will transition to visible alternatives based on Claude Opus 4.8, beginning this week.
- Requests flagged on the API will now provide a reason for their rejection instead of silently delivering inferior answers.
- However, making these safeguards visible may allow them to be circumvented more easily.
Anthropic found itself in the spotlight as the AI industry's antagonist for about 48 hours before issuing an apology.
This week, the company unveiled Claude Fable 5, which faced immediate criticism due to a safeguard hidden within its extensive 319-page system documentation. The model, part of the new Mythos class, was designed to subtly degrade responses for users it suspected of creating competing AI models—without any notice or fallback message, resulting in lower quality outputs. By Thursday, Anthropic was issuing an apology.
We are implementing changes to make Fable 5's safeguards for frontier LLM development transparent.
This week, flagged requests will visibly revert to Claude Opus 4.8—the same as our protections for cyber and biological concerns. You will be informed each time this occurs. On the API, any flagged…
— ClaudeDevs (@ClaudeDevs) June 11, 2026
"Invisible safeguards can be more precisely targeted, enabling us to deploy quickly with minimal false positives. We opted for hidden safeguards for this reason, which turned out to be the wrong choice," the company stated on X. "You deserve transparency regarding the safeguards we have implemented, and their rationale.”
“We apologize for failing to achieve the right balance."
Beginning this week, flagged requests will now explicitly switch to Claude Opus 4.8, a less capable model, rather than silently providing degraded outputs from Fable. Users of the API will be informed why a request was denied. Anthropic has announced that server-side notifications for fallbacks are set to be introduced in the coming days.
Clarifying the Controversy
For those who are not technically inclined, here's the essence of the issue. Claude Fable 5 already featured visible safeguards for cybersecurity and biological research—if a request triggered those filters, users would receive a notification indicating that their request was being redirected to the older Opus 4.8 model. This meant users were aware of the change and could modify their prompts or select a different tool.
However, some biological researchers criticized these safeguards as overly restrictive.
In contrast, the safeguard related to LLM development functioned differently. If Fable 5 detected that a user was involved in activities such as pretraining AI systems, developing distributed training frameworks, or designing machine learning chips, the model would covertly adjust its responses—through prompt modification, steering vectors, or parameter alterations—to provide inferior answers without notification. Users would receive a response, but it wouldn’t originate from the Fable 5 they had expected.
Fable 5 is marketed as Anthropic's most advanced Mythos-class model, and researchers engaged in legitimate machine learning tasks had no way of knowing that their results were compromised. A failed experiment would appear identical whether the hypothesis was flawed or the model was subtly instructed to underperform. This reproducibility issue triggered a significant backlash within the AI research community.
The classifier's precision was also lacking. AI research firm SemiAnalysis was among the first to publicly criticize the model after their GPU inference research was flagged.
BREAKING NEWS: Anthropic's latest model will NOT assist you if it deems your ML research or ML engineering to be interesting, and/or will secretly lower its performance so that the average engineer remains unaware. We are already observing Anthropic's latest model's moderation filters impacting our GPU… pic.twitter.com/9sa95cCSvS
— SemiAnalysis (@SemiAnalysis_) June 9, 2026
The Drawback of the Solution
Anthropic's change of course comes with an acknowledgment of the tradeoff it is making. By making the safeguards visible, they become easier to circumvent, necessitating a broader approach for the classifier to remain effective.
This means more false positives—legitimate machine learning activities that may be incorrectly flagged and redirected—are likely to increase as the company fine-tunes its systems. Anthropic has stated that it is working to minimize false positives "as quickly as possible," but has not provided a specific timeline.
The company is also applying similar adjustments to its biology and cybersecurity classifiers, which have faced their own criticisms for flagging non-threatening research prompts.
Nonetheless, the overarching concern remains that Anthropic is not eliminating this category of restrictions—only making them visible. For those who consider the restrictions themselves to be inappropriate, the apology issued on Thursday serves as only a partial remedy. Fable 5 will remain free for Pro, Max, Team, and Enterprise users until June 22, after which it will transition to API usage credits only.
