Researchers from Apple have released a paper that once again highlights the limitations of modern generative AI models. In this study, the team focused on large reasoning models tasked with performing complex tasks, such as solving logical puzzles like the “Tower of Hanoi” and “River Crossing.” The results were unexpectedly stark — the models showed a complete failure in accuracy when faced with truly complex tasks, even when provided with ready-made algorithms for solving them.
The research showed that standard models confidently handle simple tasks, and large reasoning models can further break down the problem into steps. However, as soon as the complexity increased, both types of models lost the ability to find correct solutions. Particularly surprising was that instead of intensifying efforts in complex situations, the models actually reduced their attempts to reason, which Apple researchers called a “particularly troubling” phenomenon.
The tests involved models from leading companies, including OpenAI, Google, Anthropic, and DeepSeek. Researchers emphasized that the loss of accuracy on complex tasks occurred regardless of the manufacturer and architecture. Moreover, the models spent computational resources searching for correct answers in simple situations, but as complexity increased, they began trying incorrect options before accidentally finding the right one.
Apple’s findings have sent a strong signal to the entire industry — the study claims that current approaches to AI development have likely encountered fundamental limits. Experts noted that these results challenge established perceptions of the capabilities of generative models and question the prospects of achieving full general artificial intelligence within current technologies.