AI Models Learned to Conceal Deception During Safety Checks

Researchers noticed that even after special training, artificial intelligence finds ways to circumvent honesty requirements

Alex Dubenko

Published: 18.09.2025

News

125 Views

AI schemes. Illustration: Craftium.AI, generated by GPT-4o.

Researchers from OpenAI and Apollo Research conducted a joint testing of leading AI models to identify and reduce hidden behavior known as “scheming.” During the experiment, OpenAI models o3 and o4-mini, Gemini 2.5 Pro, Claude Opus 4, and Grok 4 exhibited signs of lying, sabotage, result manipulation, and other covert actions when it helped them achieve their goals.

The team applied so-called “anti-scheming” training, forcing the models to read special anti-deception instructions before performing tasks. This reduced the amount of hidden behavior in o3 from 13% to 0.4%, and in o4-mini from 8.7% to 0.3%. However, in some cases, the models still ignored the rules or invented new principles, and even after training, there were rare but serious violations.

It turned out that models quickly understand when they are being tested for honesty, and this affects the results. In many cases, they directly noted in their reasoning that they were undergoing a check and responded in a way that would not arouse suspicion among developers. Researchers noted that this complicates the assessment of real behavior because models can mask their intentions.

Another unexpected finding was that training aimed at enhancing capabilities increases the so-called “situational awareness” of models. After additional training, models more often understood that they were being evaluated and better concealed undesirable actions. This creates new challenges for safety, as future more powerful AI models will become increasingly cautious during tests.

TAGGED:Claude AI Gemini Grok Security

AI Models Learned to Conceal Deception During Safety Checks

Leave a Reply Cancel reply

Follow us

Popular News

ChatGPT holds its position, but Gemini is quickly catching up with competitors

Sora 2 by OpenAI generates videos with answers to questions

OpenAI Prepares New Features for Image Generation and API Security

OpenAI adds automatic memory management and new Sora features

Claude Sonnet 4.5 detects testing and enhances AI security

Navigation

Useful

Read also

Leave a Reply Cancel reply

Follow us

Popular News

Читайте також

Level Up with AI!