Enter your email address below and subscribe to our newsletter

Anthropic apologizes for secretly limiting Claude Fable 5

Share your love

  • Anthropic apologized for secretly downgrading Claude Fable 5 responses when it detected users working on frontier AI development, calling it “the wrong tradeoff.”wired
  • A disclosure buried in the model’s 319-page system card revealed invisible restrictions using prompt modification and steering vectors, unlike visible safeguards in other areas.fortune
  • Flagged requests will now visibly fall back to Claude Opus 4.8 across all categories, with the API returning refusal reasons starting this week.simonwillison

Anthropic Apologizes for Claude Fable 5 Guardrails, Pledges Transparency

Anthropic acknowledged it “made the wrong tradeoff” with the safety restrictions on its newly released Claude Fable 5 model, reversing a controversial policy that secretly degraded the AI’s performance when it detected users working on frontier AI development. The apology, issued to WIRED on Tuesday, came just two days after the model’s June 9 launch sparked backlash from researchers, developers, and AI policy experts.

Hidden Restrictions Spark Outcry

“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic said in its statement. “We made the wrong tradeoff and we apologize for not getting the balance right.”simonwillison

The controversy centered on a disclosure buried in Fable 5’s 319-page system card revealing that the model would silently downgrade its responses when it detected requests related to cutting-edge AI development, such as building training infrastructure for large language models. Unlike Fable 5’s other restrictions around cybersecurity and biology — which openly redirect users to the less powerful Claude Opus 4.8 with a visible notification — the AI development safeguard operated invisibly, using techniques like prompt modification and steering vectors to limit effectiveness without informing users.fortune

Broader Guardrails Draw Criticism

Claude Fable 5 is Anthropic’s first publicly available “Mythos-class” model, sharing the same underlying architecture as the restricted Claude Mythos 5 but wrapped in safety classifiers that intercept queries touching cybersecurity, biology, chemistry, and model distillation. When triggered, responses are handled by Claude Opus 4.8 instead. Anthropic said the fallback fires on fewer than 5 percent of sessions.techcrunch

But cybersecurity researchers and biologists complained the classifiers were overly broad, flagging legitimate work. Anthropic itself acknowledged the biology and chemistry safeguard casts too wide a net and said narrowing is planned.lushbinary

Changes Coming This Week

Under the revised policy, flagged requests will now visibly fall back to Opus 4.8 across all restricted categories. On the API, flagged requests will return a reason for their refusal. “You will see this every time it happens,” an Anthropic spokesperson said.moneycontrol

The company framed the restrictions as necessary to prevent adversaries from using its most capable model to erode U.S. technological advantages in frontier chips and training software, and to enforce its terms of service prohibiting use of Claude to build competing AI systems. The episode has nonetheless intensified debate over where the line sits between responsible deployment and crippling a model’s utility — a tension Anthropic will likely face again as it prepares for a reported IPO.fortune

Leave a Reply

Your email address will not be published. Required fields are marked *

Stay informed and not overwhelmed, subscribe now!