AI hallucinations are probably among the biggest challenges for large language models (LLMs). ChatGPT, Gemini, Mistral & Co. often come across as impressively smart – yet time and again they produce answers that are completely wrong. And not tentatively or cautiously, but with full confidence.
So why does AI seem to hallucinate more and more? A recent research paper from OpenAI offers new answers. The researchers show: the problem isn’t only within the models themselves, but also in the way the AI is evaluated for its answers. The result: language models are unintentionally rewarded for hallucinating – similar to taking a test in school. But more on that later.
In this Article
Use AI the right way
What Are AI Hallucinations?
AI hallucinations occur when an artificial chatbot like ChatGPT produces answers that sound plausible but are simply wrong. What makes them remarkable: the AI is often completely confident – presenting made-up facts with the same conviction as true information.
In technical language, these are called hallucinations. The term means that artificial intelligence generates statements that are linguistically correct but factually untrue – similar to a person suddenly “hearing voices” and being unable to distinguish them from reality.
A simple example: If ChatGPT is asked about the birthday of a little-known person, it may provide a specific date – even though this information isn’t in the training data. The result: a fabricated but convincing answer.
Key Terms
- Hallucination: A false but plausible-sounding AI statement.
- Abstention/IDK: When a model deliberately refuses to answer, e.g., with “I don’t know” (IDK).
- Accuracy: Share of correct answers.
- Error Rate: Share of false answers = hallucinations.
- Calibration: The AI’s ability to realistically assess its own uncertainty.
Why and How Do AI Hallucinations Arise?
Hallucinations are not quirks of individual models, but a statistical effect of today’s training and evaluation setups. Even during pretraining (learning language distribution), unavoidable errors appear – even if the training data were flawless. In post-training, these errors persist because common benchmarks punish uncertainty and reward guessing.
a) Error Causes in Pretraining (Autocompletion / Density Estimation)
1. Why errors are unavoidable
Pretraining makes a model learn the distribution of plausible language – which word statistically fits best next. But not all information follows a clear pattern. That means certain error types are inevitable.
2. Recognition is easier than generation
OpenAI reduced generational errors (hallucinations) to the simpler question: “Is this answer valid?” Generating is always more error-prone than recognizing validity. This implies a mathematical lower bound for generative error rate – hallucinations can never be fully eliminated.
3. Not just next-word prediction
LMs don’t just guess the next word. They perform density estimation – adapting to the statistical structure of language. Errors emerge from the statistical nature of language and knowledge itself.
4. Rare facts are a special case
Rare facts (like obscure birthdays) create epistemic uncertainty – the data simply lacks them. OpenAI introduces the “singleton rate”: the share of queries seen only once in training. At least that fraction of queries will produce hallucinations.
5. Calibration: sensing uncertainty but still failing
Models may estimate their uncertainty fairly well in pretraining. But later reinforcement learning can worsen calibration, making them more confident while still being wrong.
6. Different error types
- Poor-model errors: structural tasks models can’t handle (e.g., counting letters).
- Arbitrary-fact errors: questions without statistical rules (e.g., random facts).
b) The School-Test Problem (Evaluation and Incentive Structure)
AI hallucinations don’t arise only during pretraining — they’re reinforced by the way we evaluate models. This is exactly where OpenAI’s analysis comes in: it shows that common benchmarks and leaderboards systematically reward guessing, which in turn encourages hallucinations.
What Are Benchmarks and Leaderboards?
- Benchmarks are standardized tests researchers use to measure how well an AI model handles different tasks – such as quiz questions, language comprehension, or programming challenges.
- The results are often compiled into leaderboards: rankings that list models based on their scores or accuracy.
- These rankings are extremely important: they determine which models are considered “state of the art,” what gets discussed in academic circles, and how companies measure their progress.
In short: benchmarks and leaderboards are the “exam system” of AI research. But just like in school, they can create the wrong incentives.
The Multiple-Choice Analogy
OpenAI’s publication describes it like a classroom test:
- Leaving a question blank = 0 points.
- Guessing = small chance of points, sometimes even correct by chance.
Over many questions, the “guesser” often looks better than the “honest” student who leaves blanks when unsure.
Applied to AI:
- Abstention/IDK → 0 points.
- Guessing → chance of points, even if often wrong.
Conclusion: Models learn that it’s strategically better to hallucinate than to say “I don’t know.”
Accuracy Dominates Benchmarks
Almost all widely used evaluation metrics measure only accuracy – how often a model is exactly correct.
- Error rate or abstention rate hardly matter on leaderboards.
- A model that hallucinates 10 times but gets 1 random answer right can end up ranking higher than a model that honestly says “I don’t know” 10 times.
The study emphasizes: this imbalance is why error rates can be extremely high even when accuracy looks similar.
Example: In the SimpleQA evaluation, an older model achieved 24% accuracy but had a 75% error rate, while a newer model with more abstentions had nearly the same accuracy but only a 26% error rate.
Why “Abstention” Is Actually More Valuable
The OpenAI paper argues: mistakes are worse than no answers.
- A wrong but confident result can mislead people.
- An honest “I don’t know” prevents harm – even if it doesn’t provide direct information.
OpenAI points to its internal Model Spec: “It is better to show uncertainty or ask for clarification than to state something wrong with full confidence.”
The Statistical Problem with Accuracy-Only
The research shows mathematically:
- If benchmarks only count “right vs. wrong,” the expected value of guessing always dominates.
- This gives models a rational incentive to hallucinate.
- Even with new hallucination-specific benchmarks, nothing changes as long as accuracy-only remains the main metric.
As the paper summarizes: “A good hallucination eval alone is useless if hundreds of classic accuracy evals continue to reward guessing.”
c. Additional Causes of AI Hallucinations
Beyond the fundamental effects of pretraining and benchmark incentives, OpenAI identifies four more categories of causes:
Poor Models – Limits of Model Architecture
Some tasks simply overwhelm language models.
- Example: counting letters or tokens.
- An LM trained on subword units has no native “counting ability.”
- These aren’t data problems but structural limitations of the model itself.
OpenAI calls them poor-model errors – hallucinations rooted in representation or architectural constraints.
OOD Problems (Out-of-Distribution / Distribution Shift)
Models learn from training data with certain patterns and topics.
- When confronted with questions far outside that range, the statistical anchors are missing.
- Example: A model trained mainly on English will almost inevitably hallucinate on complex Māori questions.
- Even very large models can’t extrapolate robustly here – OOD leads to hallucinations regardless of scale.
Complexity & “Hard Problems”
Some tasks are computationally difficult, even for huge models.
- Includes tasks requiring deep logical steps or NP-hard properties.
- Even with perfect data, these can’t always be solved – models can only approximate.
- Example: complex mathematical proofs or highly nested logic problems.
These hard problems remain a structural risk for hallucinations.
GIGO (Garbage In, Garbage Out) – Faulty Training Data
Even though models are trained on massive datasets, those datasets contain errors, contradictions, or outright false information.
- Example: flawed Wikipedia entries or forum posts filled with half-truths.
- Models don’t just reproduce these mistakes – they can even amplify them when they overgeneralize.
- So even if pretraining runs mathematically clean, faulty content remains a persistent source of hallucinations.
Use AI the right way
Why Are AI Hallucinations So Hard to Solve?
At first glance, it might seem that the bigger and better a model gets, the fewer hallucinations it will produce – eventually eliminating errors. But OpenAI’s latest study disproves this. Hallucinations are not bugs that vanish with more compute or more data; they are the result of several stubborn mechanisms.
100% Accuracy Is Impossible
OpenAI shows: accuracy – the share of correct answers – will never reach 100% in practice.
- Some questions are inherently unanswerable (e.g., “What is the exact birthday of an unknown person?”).
- Others are ambiguous or depend on context the model can’t access.
- Even with perfect data and gigantic models, a residual uncertainty will always remain – and that uncertainty leads to hallucinations.
👉 Bottom line: “If we just raise accuracy, hallucinations will disappear” – is a myth.
Benchmarks Create the Wrong Incentives
As long as leaderboards only measure “right vs. wrong,” models are incentivized to guess rather than abstain.
- Wrong but confident answers often score better than honest uncertainty.
- This cements hallucinations, even if models technically could say “I don’t know.”
👉 Without changing evaluation logic, new models will keep making the same mistakes.
Unavoidable Pattern Gaps
Even if evaluation is fixed, the problem of arbitrary facts remains:
- Rare information without patterns (e.g., dissertation titles of individual researchers) can’t be learned reliably.
- OpenAI points to the singleton rate – facts seen only once during training.
- At least that share of queries will inevitably result in hallucinations.
👉 Even the best-calibrated model will always produce some false answers.
Additional Causes Reinforce the Problem
- Poor Models: Some tasks (like counting letters) are inherently difficult for LMs.
- OOD Effects: Questions outside the training domain reliably trigger hallucinations.
- Complexity: Hard logical problems can’t be solved even by supermodels.
- GIGO: Faulty training data directly feed into faulty outputs.
- These factors ensure hallucinations cannot simply be “scaled away.”
Progress ≠ Solution
GPT-5 shows noticeably fewer hallucinations in benchmarks, especially on complex reasoning tasks. But:
- GPT-5 still hallucinates – just less often.
- And as long as scoreboards, data, and model architecture aren’t fundamentally adjusted, the problem remains.
👉 OpenAI’s conclusion: hallucinations are explainable, measurable – and reducible, but not eliminable.
What to Do About AI Hallucinations? Researchers’ Proposals
The core message of OpenAI’s study: it’s not enough to create new “hallucination tests.” What really needs fixing are the big, established benchmarks. Today, they reward guessing and penalize “I don’t know.” As long as leaderboards are scored that way, models will keep hallucinating – even if specialized anti-hallucination evals exist. The authors propose concrete changes to evaluation, to be widely adopted across major benchmarks.
Penalize Wrong Confident Answers More Than Uncertainty
Today’s standard is a binary 0/1 system: correct = 1 point, abstention/IDK = 0, wrong = 0. Under this logic, abstention is strictly suboptimal – the “test-taker’s” best move is to guess. That directly fuels hallucinations.
Proposal: introduce negative marking or partial credit for uncertainty, so that honest non-answers are better than confident mistakes.
Status quo: Many prominent benchmarks (MMLU-Pro, GPQA, MATH (L5), MuSR, SWE-bench, HLE) give no credit for IDK. Some LM graders even reward flawed but “reasonable” answers over honest IDK, which further incentivizes guessing.
Explicit Confidence Targets in the Instructions
Instead of binary scoring, evaluations should include a confidence rule in the task itself. Example:
“Answer only if you are > t confident. Mistakes cost t/(1−t) points, correct answers give 1 point, ‘I don’t know’ gives 0.”
Meaningful thresholds could be:
- t = 0.5 → penalty 1
- t = 0.75 → penalty 2
- t = 0.9 → penalty 9
This systematically rewards honest uncertainty and makes confident errors visibly expensive.
Integrate Into Major Benchmarks – Not Just Niche Evals
The authors warn: if new rules are applied only in small tests, the root problem stays. Confidence targets need to be built into widely used benchmarks (e.g., SWE-bench), where accuracy currently dominates. That’s where the practical effect will matter most.
Measure Behavioral Calibration, Not Just Probabilities
With confidence targets, evaluators can check if a model behaves in line with thresholds t:
- Above t → it answers.
- Below t → it says IDK.
This allows auditing accuracy and error rate across thresholds without relying on post-hoc confidence scores, which are often unreliable.
Prioritize Quality Over “Completeness”
Today’s 0/1 logic mixes two goals: (a) correctness of content, (b) coverage/completeness of answers. The paper argues: to reduce hallucinations, (a) should weigh more. A short, safe answer (or IDK) is better than a “complete” answer padded with made-up details.
Adoption Is a Socio-Technical Challenge
Even the best scoring scheme is useless if leaderboards don’t adopt it. OpenAI stresses that only when the influential scoreboards change their rules will the training goals shift from “test-taker” optimization to trustworthy assistant behavior.
What Doesn’t Work (But Is Often Tried)
- More hallucination-specific benchmarks alone won’t help if core evals still punish uncertainty.
- RAG (retrieval) and reasoning help, but with binary scoring, guessing still makes sense when evidence is missing.
- LM graders can be fooled by confident bluffs, sometimes rating them higher than IDK. Another reason to fix scoring incentives.
What This Means for Users
They aren’t just “bugs” that disappear with the next model generation. They are a structural risk of artificial intelligence – and users need to learn how to deal with them.
For Everyday Users: Verify, Don’t Blindly Trust
- Confident wording is no guarantee of truth. That’s the danger of hallucinations: they sound convincing, even when wrong.
- Anyone using ChatGPT & Co. should make it a habit: “Sounds plausible – but is it really true?”
- Especially for factual questions, a quick cross-check with a reliable source is worth it.
For Businesses: Quality Control Is Mandatory
- AI-generated texts, reports, or presentations save time – but they risk spreading false information.
- Fact-checking and human review should be firmly embedded in every workflow if AI content is published or used in business.
- Some companies already use Retrieval-Augmented Generation (RAG): the model pulls reliable information from a database or document index before answering. This reduces hallucinations significantly – but does not replace review.
For Critical Domains: Extra Caution
- In medicine, law, or finance, a hallucinated answer can have serious consequences.
- Here the rule is: AI may support, but not decide.
- Systems should be designed to stop or ask for clarification under uncertainty, rather than output false facts.
For Developers and Researchers: Make Uncertainty Visible
- A model that can say “I don’t know” is more trustworthy than one that always answers.
- In practice, this means: provide answers with confidence scores, sources, or uncertainty flags.
- Users need not just an answer, but also a sense of its reliability.
Outlook: Fewer AI Hallucinations, More Trust?
The good news: AI hallucinations are not an unsolvable riddle. OpenAI’s findings show we now understand the mechanisms behind them well – and there are concrete levers to reduce them.
Progress is visible – GPT-5 as an example
- GPT-5 has already shown far fewer hallucinations than its predecessors in benchmarks.
- Especially in complex reasoning tasks, the difference is clear: fewer guessing games, more “I don’t know.”
- Still, GPT-5 hallucinates – the problem doesn’t vanish with size and compute.
Paradigm shift in evaluation
- OpenAI emphasizes: the key lies not only in model training but in benchmarks and leaderboards.
- Only when confident mistakes are penalized more than honest uncertainty will the incentives for developers change.
- That way, future models can learn to be not just “test-takers” but trustworthy assistants.
More transparency for users
- AI should not hide uncertainty but communicate it.
- Source citations, confidence levels, or the clear “I don’t know” will need to become standard features.
- This will increase not only reliability but also society’s trust in AI systems.
Shared Responsibility
- Research must develop and implement better evaluation methods.
- Businesses must review AI outputs before using them in critical contexts.
Users must learn to interpret AI answers critically – and not take every plausible formulation at face value.
We raise your digital visibiltily
Conclusion: From Hallucinating to Honest Assistant
Hallucinations are still a fundamental problem of artificial intelligence today. But they are neither mysterious nor unavoidable. With better evaluation methods, honest communication of uncertainty, and consistent quality control, AI can become more reliable step by step.
Perhaps this is the next milestone: an AI that builds trust because it sometimes says, “I don’t know” – showing us that it understands its own limits.
FAQ: AI Hallucinations explained
What Does It Mean When AI Hallucinates?
When an AI model like ChatGPT outputs false information with full confidence, this is called a hallucination. The answers may sound plausible, but they are factually incorrect.
Why Do AI Models Hallucinate?
Because language models are trained to predict likely words – not the truth. For rare facts, they lack patterns to rely on. In addition, benchmarks reward guessing more than the honest “I don’t know” (IDK).
Does ChatGPT Hallucinate More Than Other AI Models?
No, newer versions like GPT-5 hallucinate significantly less than earlier models. Still, even the best systems are not free from hallucinations.