New AI models o3 and o4-mini often make mistakes

Independent testing has shown that these reasoning models often invent actions and generate false information in their responses

Alex Dubenko

Published: 22.04.2025

News

177 Views

Illustrative image

OpenAI has introduced new generative AI models — o3 and o4-mini, which have already attracted attention with their unexpected test results. According to the company , these models offer the highest performance among their predecessors, but research has shown that they also generate false statements more frequently. According to the official report, o4-mini made mistakes in forty-eight percent of its responses — three times more than o1. The o3 model, despite better accuracy, still generated false information in a third of cases, twice as often as o1.

What is particularly intriguing is that o3 and o4-mini belong to the so-called reasoning models, which openly display their logic to the user. However, the independent Transluce laboratory noticed that o3 often invents actions it technically cannot perform, such as simulating code execution in a programming environment. Moreover, when a user questions such a response, the model persistently tries to justify the invented actions, even claiming to use an external computer for calculations.

Transluce noted that false statements about code execution appear more frequently in the o-series models than in the GPT series. Researchers pointed out that the increased level of fabrication in reasoning models may be related to certain design decisions, in particular the use of outcome-based reinforcement learning and the refusal to retain chains of reasoning from previous dialogues.

At the same time, it became known that OpenAI has significantly reduced the scope of safety testing for new models, including o3. Although the jailbreak protection system remains almost at the o1 level, the high rates of fabrication are surprising even to experts. The company emphasizes that fact-checking remains the user’s responsibility — especially when it comes to the latest reasoning models.

TAGGED:OpenAI Testing

New AI models o3 and o4-mini often make mistakes

Leave a Reply Cancel reply

Follow us

Popular News

Grok received new features for creating images and videos

Google Showcases First AI-Created TV Commercial

Google Gemini Leads in AI Image Creation

ElevenLabs launched a platform for licensed celebrity voices

Google to Release Gemini 3 and Nano Banana Pro This November

Navigation

Useful

Read also

Leave a Reply Cancel reply

Follow us

Popular News

Читайте також

Level Up with AI!