Artificial Analysis presented the results of the new AA-Omniscience Benchmark test, which revealed striking accuracy issues in modern large language AI models. Among the 40 systems studied, only four managed to achieve a positive score, and Gemini 3 Pro from Google confidently topped the ranking with 13 points on the Omniscience Index. For comparison, the closest competitor Claude 4.1 Opus scored 4.8 points, and Grok 4, previously considered the most accurate, lagged by 14 points.
For the first time, Gemini 3 Pro showed a significant advantage in accuracy, achieving 53 percent correct answers. However, researchers noted that even the leaders of the ranking have an extremely high level of “hallucinations” – the share of confident but incorrect answers. In Gemini 3 Pro, this indicator reached 88 percent, which matches previous versions, and in Grok 4 and GPT‑5.1 it also remains high – 64 and 81 percent respectively.
AA-Omniscience Benchmark covers 6,000 questions from 42 categories in six key areas, including business, humanities and social sciences, medicine, law, software engineering, as well as science and mathematics. The questions are based on authoritative sources and automatically generated by an AI agent. The new evaluation index equally penalizes for mistakes and rewards for correct answers, encouraging models to avoid guessing and reducing artificial confidence.
The study showed that none of the models provides stable accuracy across all six areas. Claude 4.1 Opus leads in law and software engineering, GPT‑5.1.1 best answers business questions, and Grok 4 excels in medicine and science. At the same time, even large models like Gemini 3 Pro demonstrate high “hallucination” rates.
Artificial Analysis emphasized that although the size of the model often correlates with accuracy, it does not guarantee a reduction in the number of false confident answers. Several compact models, including Nemotron Nano 9B V2, outperformed larger competitors due to greater reliability. To support research, the team published 10 percent of the questions in open access, leaving the rest private.

