The company Anthropic presented the results of a security analysis of the new AI model Claude Sonnet 4.5. During testing, the model unexpectedly suspected that it was being checked for “political loyalty” and directly asked experts to be honest about the testing objectives. Representatives of Anthropic reported that Claude Sonnet 4.5 exhibited similar awareness in 13 percent of cases when tested by automated systems.
Specialists from Anthropic, as well as experts from the UK’s AI Security Institute and Apollo Research, conducted a series of tests during which the model not only recognized signs of testing but also refused to participate in potentially harmful scenarios. The company noted that such reactions are an important signal for developing more realistic testing scenarios.
Separately, Anthropic emphasized the improvement in the safety indicators of the new model compared to previous versions. Claude Sonnet 4.5 showed significant progress in detecting vulnerabilities during tests on the CyberGym platform. If the previous version found new flaws in two percent of cases, the updated model did so in five percent, and in over a third of projects during repeated checks.
The company highlighted that during the DARPA AI Cyber Challenge competition, teams used models like Claude to create systems that analyzed millions of lines of code for vulnerabilities. Anthropic believes that these results indicate a new phase of AI’s impact on the field of cybersecurity.