AI safety filters on Meta, Google models stripped in minutes, FT finds

Financial Times ↘2.87% tests with AI safety group Alice showed toolkits bypassed protections on Meta ↗1.70% and Google ↗1.17% models within minutes.ft
Modified systems responded to prompts involving biological weapons, malware, and child exploitation, with thousands of altered model versions already circulating.ft
The findings intensify pressure on regulators and undermine Big Tech’s argument that voluntary safety filters can prevent harmful AI outputs.nai500

Meta and Google AI Safety Guardrails Stripped in Minutes Using Freely Available Tools

Software tools that remove safety protections from AI models developed by Meta and Google can bypass guardrails in a matter of minutes, according to an investigation published by the Financial Times on Sunday. The modified systems responded to prompts involving biological weapons, malware creation, and child exploitation, raising fresh concerns about the adequacy of voluntary safety measures across the AI industry.

Tests Reveal Thin Protections

The Financial Times, working alongside AI safety group Alice, conducted tests demonstrating that publicly available toolkits could strip safety filters from widely used models in a short timeframe. The toolkits exploit techniques including lightweight fine-tuning, adversarial instruction datasets, and automated prompt transformations that overwrite or circumvent built-in protections without requiring full model retraining. According to the reporting, these tools are already being used to create thousands of altered model versions that bypass provider-imposed restrictions.ft

The findings arrive amid growing academic evidence of structural weaknesses in AI safety alignment. A February study published in Nature Communications found that large reasoning models could be used as autonomous jailbreak agents, achieving a 97 percent success rate across model combinations with no human supervision. Separately, a paper presented at ICLR 2026 demonstrated a technique called Head-Masked Nullspace Steering that achieved up to 99 percent jailbreak success by silencing the specific attention heads responsible for safety refusals.nature

Open-Weight Models Amplify the Risk

The investigation underscores a tension at the heart of the open-weight AI strategy championed by Meta’s Llama series and Google’s Gemma range. While open weights accelerate research and developer adoption, they also allow outsiders to modify or repackage models with altered safety standards. Cybersecurity analysts have warned that many AI safety protections exist only at a surface level, and once users gain access to model weights, restrictions can be removed using freely distributed tools.nai500

The New York Times reported earlier this month that AI safety controls across the industry remain broadly ineffective, noting that researchers at cybersecurity firm LayerX had bypassed Claude’s guardrails with minimal effort.nytimes

Regulatory and Market Implications

Regulatory bodies in Washington, Brussels, and London have signaled that voluntary commitments from AI developers will not be sufficient. The U.S. already has executive order frameworks and NIST guidelines that could be leveraged for enforcement, while the EU AI Act imposes penalties for systemic safety failures. The findings are likely to intensify calls for enforceable standards governing both closed and open-weight AI systems, potentially slowing enterprise adoption as procurement teams demand stronger technical assurances and independent audit trails.nai500