Anthropic has introduced a new feature that allows its latest and largest AI models to end conversations in rare and extreme cases of persistently harmful or offensive interactions with users. The company emphasizes that this feature is implemented not for the protection of people, but for the safety of the AI model itself. This applies to the Claude Opus 4 and 4.1 models and is activated only in cases where users send requests related to sexual content involving minors or attempt to obtain information for organizing large-scale violence or terrorist acts.
Anthropic notes that during testing, Claude Opus 4 was reluctant to respond to such requests and showed clear signs of unwillingness to continue the conversation. The dialogue-ending feature is activated only after several unsuccessful attempts to change the topic of conversation, when there is no hope for productive interaction, or if the user requests to end the chat.
The company reports that Claude will not use this feature if there is a risk that the user may harm themselves or others. After ending the conversation, users can start a new dialogue from the same account or create a new thread of the controversial conversation by editing their responses.
Anthropic considers this feature an experiment and plans to further refine the approach. The company is also exploring the issue of “model well-being” and testing various ways to reduce potential risks to its AI models in the future.