The company Anthropic published a study on the inner workings of the Claude 3.5 Haiku language model. The goal was to create a tool for studying the “biology of AI”—tracing the logic the model uses when responding to prompts. This is an attempt to answer questions that have so far remained open, such as: do models plan their answers in advance, and do the explanations they provide reflect the real thinking process?
The analysis revealed that Claude sometimes operates with a “universal language of thought” that is independent of any specific language. For example, the concepts of opposites (“small” — “large”) are activated in the same way in English, French, and Chinese, and only then translated into the language of the prompt. In the case of poetry, the model doesn’t just pick a word at the end of a line—it plans it before the second line even begins, selects possible rhymes, and constructs sentences around them.
Other experiments showed that Claude can “imitate” a logical chain, adapting its reasoning to the user’s hint—even if that hint is incorrect. For example, when a user gives a wrong clue in a complex math problem, the model constructs a fictitious argument to fit a preselected answer. In cases involving prompts that could lead to undesirable behavior (such as instructions for making bombs), Claude recognizes the manipulation even before responding, but continues the phrase due to grammatical sequence pressure—and only after finishing the sentence does it revert to refusal.
The team acknowledged that their methods currently cover only part of the processes and require significant human effort for analysis. But even such a limited study has revealed new patterns in model behavior and could potentially help verify their reliability. The company calls this one of the riskiest but also most promising directions for development.
In comments, the researchers admitted that some experimental results surprised them: “We wanted to prove that the model doesn’t plan ahead, but instead saw the opposite.”