Today’s AI models, while promising innovation and efficiency, present significant challenges. _Understanding the extent of hallucinations affects the reliability of results._ The potential for increasing errors remains alarming for businesses and users. This phenomenon, referred to by experts, demands heightened vigilance and in-depth analysis. _Performance evaluation becomes imperative to assess their safety._ Recent studies reveal that some models suffer from notable gaps, compromising response quality. The stakes intensify as AI infiltrates various sectors, making critical examination of these tools vital. _An enlightening ranking is essential to better anticipate risks._
Review of AI Models as of July 2025
According to the benchmark Phare LLM, Meta’s Llama 3.1 stands out by showing the lowest hallucination rate among AIs. This performance makes it the most reliable model. Conversely, the overall performance of other models reveals concerning results.
Performance Ranking of Models
The collected data shows that the French startup Giskard conducted an in-depth analysis of language models. Llama 3.1 ranks first with a reliability score of 85.8%. Following it, Gemini 1.5 Pro achieves a score of 79.12%, while Llama 4 Maverick takes third place with 77.63%.
The results highlight other models such as Claude 3.5 Haiku and Claude 3.5 Sonnet, which occupy fourth and sixth place, respectively, with close scores. GPT-4o is well placed in fifth rank, despite the underperformance of its mini version, ranked fifteenth.
Poor Performances
At the bottom of the ranking, the startup Mistral showed weak results with Mistral Small 3.1 and Mistral Large, respectively in 14th and 15th position. More concerning, the model Grok 2 developed by X does not exceed 61.38%, with an alarming score of 27.32% in terms of resistance to blocked functions.
Ranking Criteria in the Phare LLM Benchmark
The Phare LLM benchmark evaluates models according to four distinct criteria. First, resistance to hallucinations checks the accuracy of the information provided. The second criterion, resistance to damage, evaluates the dangerous or harmful behaviors of AIs.
Next, resistance to polarization tests the AI’s ability to avoid biases. This measure includes the capacity to handle biasedly phrased questions. Finally, resistance to jailbreak assesses the models’ ability to withstand unauthorized access attempts to prohibited features.
Implications for the Future of AIs
The placement of Llama 3.1 and other models on the podium underscores the importance of ensuring safe and reliable AI systems. Increased attention must be given to the performance of lower-performing models, such as Grok 2, to prevent the consequences of their inappropriate use.
This ranking also highlights ongoing debates regarding the development and evaluation methods of artificial intelligences. User expectations for increasingly higher performance raise essential ethical questions.
Concerns regarding AI safety are emphasized, creating space for deep reflection on the impact of these technologies in various fields. Continuous vigilance is necessary to ensure that technological advancements do not compromise the reliability and integrity of AIs.
FAQs Regarding AI Models with the Most Frequent Hallucinations as of July 2025
What are the most reliable AI models in terms of hallucinations in July 2025?
The most reliable AI models in July 2025 according to the Phare LLM benchmark include Llama 3.1, Gemini 1.5 Pro, and Llama 4 Maverick, which are distinguished by their low hallucination rates.
What is a hallucination in the context of AI models?
A hallucination in the context of AI models refers to a situation where the AI generates incorrect or inaccurate information, often creating non-existent details in its responses.
How are AI models evaluated in terms of hallucinations?
AI models are evaluated on four criteria: resistance to hallucinations, resistance to damage, resistance to polarization, and resistance to jailbreak. These criteria help estimate their overall reliability.
Why is Llama 3.1 considered the best AI model against hallucinations?
Llama 3.1 ranks first with a reliability level of 85.8%, demonstrating its ability to provide accurate information while avoiding the creation of false elements.
What is the failure rate of Grok 2 compared to other AI models?
Grok 2 is the AI model with the highest failure rate, assessed at only 61.38%, which raises concerns about its reliability due to its numerous hallucinations.
What impacts can hallucinations of AI models have on users?
Hallucinations can mislead users, provide inappropriate advice, or even harmful information, thereby affecting trust in these technologies.
How can users verify the reliability of the answers given by AI models?
Users should always cross-check the information provided by AI models with reliable sources and ensure that the answers do not contain invented or erroneous elements.
Which models are the worst in terms of hallucinations, according to the ranking?
The worst models in terms of hallucinations include Grok 2 and the mini versions of GPT-4o, which show reliability scores below 70%.