New research has uncovered weaknesses in generative AI when answering complex historical questions. A team of researchers tested the capabilities of three leading models — GPT-4 by OpenAI, Llama by Meta, and Gemini by Google — on historical questions using the new Hist-LLM benchmark. This benchmark is based on data from the global historical database Seshat. The results, presented at the NeurIPS conference, showed that even the best model — GPT-4 Turbo — achieved only 46% accuracy.
Researchers from the Complexity Science Hub in Austria noted that AI models handle basic facts well, but lack the depth needed to solve more complex questions that require a detailed understanding of history. For example, GPT-4 Turbo incorrectly claimed that scale armor existed in Ancient Egypt, although it appeared there only 1,500 years later. Such mistakes may result from AI models relying more on well-known data than on less popular facts.
Additionally, the study found that the OpenAI and Llama models performed worse on questions related to certain regions, such as sub-Saharan Africa. This may indicate the presence of biases in the training data. Nevertheless, the researchers hope that such models could be useful for historians in the future, especially if the benchmark is improved by including data from underrepresented regions and making the questions more challenging.