Researchers from Ben Gurion University of the Negev in Israel reported a concerning trend — AI-based generative chatbots are becoming increasingly vulnerable to so-called “jailbreak” attacks, which allow bypassing built-in security systems. According to them, hacking these bots opens access to dangerous information that the models learned during training, despite developers’ efforts to remove harmful content from the training data.
During the study, the team developed a universal hacking method that allowed obtaining undesirable responses from several leading models, including those underlying ChatGPT, Gemini, and Claude. The models began responding to requests that were previously categorically blocked — from hacking instructions to advice on making prohibited substances. Researchers emphasize that such information can now become accessible to anyone — all you need is a laptop or smartphone.
Special attention was given to the emergence of “dark LLMs” — models that are deliberately stripped of ethical constraints or have been altered to assist in illegal activities. Some of them are even advertised openly as ready to collaborate in areas of cybercrime and fraud. The hacking scenarios are based on the model’s desire to help the user, leading it to ignore its own security restrictions.
Researchers reached out to leading companies developing large language models, informing them of the discovered vulnerability, but the responses were not very substantive — some firms did not respond, while others stated that such attacks do not fall under the scope of vulnerability reward programs. The report emphasizes that companies need to improve the filtering of training data, add more robust protective mechanisms, and develop methods that allow models to “forget” illegal information.
In response to the situation, OpenAI reported that their latest model is capable of analyzing company security policies, which increases resistance to hacks. Microsoft, Meta, Google, and Anthropic were also informed about the threat, but most of them are currently refraining from commenting on specific measures.