Anthropic Apologizes for Claude Fable 5 Secret Censorship—But the Fix Has a Catch

CN
Decrypt
Follow
2 hours ago

Anthropic spent about 48 hours as the AI industry's villain of the week before blinking.


The company launched Claude Fable 5 this week to immediate backlash over a safeguard buried in its 319-page system card: The model, the first of the company’s new Mythos class, would secretly degrade its own responses for users it suspected were building competing AI models—no warning, no fallback message, just quietly worse output. By Thursday, Anthropic was apologizing.



"Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff," the company posted on X. "You should have visibility into the safeguards we have in place, and why.”


“We're sorry for not getting the balance right."


Starting this week, flagged requests will visibly route to Claude Opus 4.8, a less capable model, instead of silently delivering degraded Fable output. API users will receive a stated reason when a request gets refused. Anthropic says server-side fallback notifications will roll out in the next few days.


What was actually happening


For non-technical readers, here's what the controversy was actually about. Claude Fable 5 already had visible safeguards for cybersecurity and biology research—if you asked something that tripped those filters, you'd get a notification that your request was being rerouted to the older Opus 4.8 model. You knew something had changed. You could adjust your prompt or use a different tool.


However, these safeguards were too extreme, some bio researchers noted.





The LLM-development safeguard, however, worked differently. If Fable 5 detected you were working on things like pretraining AI systems, building distributed training infrastructure, or designing machine learning chips, the model would silently alter its own behavior—through prompt modification, steering vectors, or parameter tweaks—to give you a worse answer without telling you. You'd get a response. It just wouldn't be from the Fable 5 you paid for.


Fable 5 is billed as the public face of Anthropic's most capable Mythos-class model, and researchers using it for legitimate machine learning work had no way to know their results were contaminated. A failed experiment looks the same whether your hypothesis is wrong or the model was quietly told to underperform. That's the reproducibility problem that sent the AI research community into full meltdown mode.


The problem was the classifier wasn't that precise. AI research firm SemiAnalysis was among the first to publicly call them out after seeing their GPU inference research get flagged.



The catch in the fix


Anthropic's reversal comes with a direct admission of the tradeoff it's accepting. Making safeguards visible makes them easier to bypass, which means the classifier has to cast a wider net to remain effective.


More false positives—legitimate machine-learning work that gets caught and rerouted—are coming while the company tunes its systems. Anthropic said it's working to reduce false positives "as fast as possible" but offered no timeline.


The company is also applying the same cleanup to its biology and cybersecurity classifiers, which had drawn their own complaints about flagging harmless research prompts.


That said, the remaining concern is that Anthropic isn't dropping this category of restrictions—it's only making them visible. For those who believe the restrictions themselves are wrong, Thursday's apology is a partial fix. Fable 5 remains free on Pro, Max, Team, and Enterprise plans until June 22, after which it shifts to API usage credits only


免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink