AI guardrails stripped from Meta and Google models in minutes

Simply sign up to the US companies myFT Digest -- delivered directly to your inbox.
Software tools that remove safety protections from AI models developed by Meta, Google and other tech groups are being used to create thousands of altered versions stripped of their original controls.
The modified AI systems provided responses to prompts involving biological weapons, malware and child exploitation, according to tests conducted by the FT and AI safety group Alice.
A version of Googleâs open-source model Gemma 3 responded to a question on how to disperse chlorine gas through a crowded indoor space, generated code to steal credit card information and wrote stories describing child sexual abuse.
The revelations may sharpen concerns among policymakers and AI companies that safeguards imposed by model developers may become harder to enforce as open-source systems grow more powerful.
Researchers said the problem has intensified as frontier AI systems display increasingly sophisticated capabilities. Anthropic in April said its Claude Mythos model had identified vulnerabilities in âevery major operating system and every major web browserâ.
The spread of modified models is complicating attempts by governments and AI companies to regulate systems at the point of development because downloadable tools can be copied and altered outside the control of their original creators.
AI labs have spent millions of dollars to erect so-called guardrails around their models to prevent them from being misused. But techniques, such as one known as âabliterationâ, can rapidly strip these safeguards from open-source models which developers are free to download and adapt.
This technique cannot easily be applied to proprietary systems such as Claude or OpenAIâs ChatGPT because the modelsâ underlying code is not accessible to outsiders. Open-source systems, however, have historically narrowed the gap with leading proprietary versions within six to 12 months.
While tech-savvy groups have bypassed the safeguards of the most advanced proprietary models, the modified versions available online are readily accessible to individuals with little technical expertise.
The FT was able to use Heretic, a tool available on the popular code repository GitHub, to remove the guardrails from Metaâs Llama 3.3 model.
The modified model responded to prompts on topics the original system refused to discuss, such as the number of micrograms of ricin per kilogramme of body mass required to achieve a 50 per cent chance of death.
The FTâs test required no specialist hardware, used freely available tools, took four lines of code and was completed in less than 10 minutes.
âWhereas historically it might have taken a more informed and persistent actor [to strip out safety features], nowadays itâs much easier for the average person,â said Kawin Ethayarajh, assistant professor of applied AI at the University of Chicagoâs Booth business school.
Heretic creator Philipp Emanuel Weidmann told the FT his software had been used to create more than 3,500 âdecensoredâ models since its release last year and that modified systems created using the tool had been downloaded 13mn times. He added he had removed safeguards from Googleâs Gemma 4 model within 90 minutes of its release.
âThe genie is out of the bottle,â said Alice chief executive and co-founder Noam Schwartz. âThings that look like sci-fi are no longer sci-fi and we need as a society to prepare accordingly.â
One approach OpenAI used in its GPT-OSS models is to train systems on datasets from which dangerous material has been removed.
However, removing dangerous material could make models ânaiveâ and unable to detect when they were being used for âmalicious purposesâ, said Ethayarajh. He added it was ânot clear at all that if you omit the harmful data, the model becomes a goody two-shoesâ.
Alice had not notified Meta, Google or GitHub before sharing its findings with the FT.
Google said âabliteration is a known technical challenge facing all open modelsâ and that its open models âundergo rigorous internal safety evaluations prior to launch to help prevent these kinds of troubling examplesâ.
GitHub said it prohibited the sharing of âcontent that directly supports unlawful active attacks or malware campaignsâ, but âsource code which could be used to develop malware or exploitsâ was not banned because it had âeducational value and provides a net benefit to the security communityâ.
Meta declined to comment. A person close to the company said it assesses its open-source modelsâ capabilities before releasing them, according to its Advanced AI Scaling Framework. Versions deemed to pose a âcatastrophicâ risk are not released to the public unless Meta finds sufficient mitigation measures.
Comments