Gemini 3 Pro tops the model accuracy test (but continues to hallucinate)

Artificial Analysis presented the results of the new AA-Omniscience Benchmark test, which revealed striking accuracy issues in modern large language AI models. Among the 40 systems studied, only four managed to achieve a positive score, and Gemini 3 Pro from Google confidently topped the ranking with 13 points on the Omniscience Index. For comparison, the closest competitor Claude 4.1 Opus scored 4.8 points, and Grok 4, previously considered the most accurate, lagged by 14 points.

For the first time, Gemini 3 Pro showed a significant advantage in accuracy, achieving 53 percent correct answers. However, researchers noted that even the leaders of the ranking have an extremely high level of “hallucinations” – the share of confident but incorrect answers. In Gemini 3 Pro, this indicator reached 88 percent, which matches previous versions, and in Grok 4 and GPT‑5.1 it also remains high – 64 and 81 percent respectively.

AA-Omniscience Benchmark covers 6,000 questions from 42 categories in six key areas, including business, humanities and social sciences, medicine, law, software engineering, as well as science and mathematics. The questions are based on authoritative sources and automatically generated by an AI agent. The new evaluation index equally penalizes for mistakes and rewards for correct answers, encouraging models to avoid guessing and reducing artificial confidence.

The study showed that none of the models provides stable accuracy across all six areas. Claude 4.1 Opus leads in law and software engineering, GPT‑5.1.1 best answers business questions, and Grok 4 excels in medicine and science. At the same time, even large models like Gemini 3 Pro demonstrate high “hallucination” rates.

Artificial Analysis emphasized that although the size of the model often correlates with accuracy, it does not guarantee a reduction in the number of false confident answers. Several compact models, including Nemotron Nano 9B V2, outperformed larger competitors due to greater reliability. To support research, the team published 10 percent of the questions in open access, leaving the rest private.

Gemini 3 Pro tops the model accuracy test (but continues to hallucinate)

Leave a Reply Cancel reply

Follow us

Popular News

Grok received new features for creating images and videos

Sora by OpenAI now available for Android users in seven countries

Google Showcases First AI-Created TV Commercial

OpenAI prepares GPT-5.1 for complex user tasks

Google Gemini Leads in AI Image Creation

Navigation

Useful

Read also

Leave a Reply Cancel reply

Follow us

Popular News

Читайте також

Level Up with AI!