← Blog
Research

25 LLM citation statistics: what the data shows

LLM citation accuracy is one of the most consequential and least understood problems in enterprise AI adoption. These 25 statistics reveal what happens when you actually verify the references.

16 min readPublished June 22, 2026Mentionova Research

When a model generates a reference, it looks authoritative. The format is correct. The author names sound plausible. The journal title is familiar. But the citation may not exist, and even when it does, it often does not support the claim it is attached to. Over the past 18 months, researchers have audited LLM citation behavior at scale: hundreds of thousands of generated references, verified against CrossRef, OpenAlex, Semantic Scholar, and JAMA. The results are consistent and uncomfortable.

Key takeaways

  • LLM hallucination rates in citations span 11% to 95% across models and domains; no model is safe without verification.
  • GPT-4o produced 19.9% fabricated citations in a mental health literature review, and 45.4% of its real citations contained bibliographic errors.
  • Even when citations are real, 50 to 90% fail to actually support the claims they are attached to.
  • Models with 70B+ parameters show 1 to 5% hallucination rates versus 25%+ for smaller models, but cost scales accordingly.
  • Conflict-aware prompting improves citation accuracy substantially without changing the underlying model.
  • LLMs systematically favor highly cited papers, reinforcing existing citation hierarchies and under-citing newer or niche work.
  • Enterprises must treat LLM citations as untrusted by default and implement systematic verification layers before any citation is published or acted upon.

Hallucination and false attribution

LLM citation hallucination is not a fringe edge case. It is the baseline condition across most models and domains. The question is not whether a given model hallucinates citations. It is how often, and whether your verification architecture catches it before it causes damage.

1. Hallucination rates range from 14.23% to 94.93% across 13 models

The GhostCite benchmark, published in 2025, evaluated 375,440 LLM-generated citations from 22,800 API interactions across 13 models and 40 domains. DeepSeek showed the lowest hallucination rate at 14.23%. Hunyuan showed the highest at 94.93%.

That is a sixfold spread between the best and worst performers. Model choice is not a minor variable in citation-intensive workflows. It is a primary driver of reliability.

2. Cross-model hallucination rates span 11.4% to 56.8% across 69,557 citations

A 2025 cross-model audit of GPT-4o, Claude 3, Gemini, Llama, and others found 11.4% to 56.8% hallucination rates across 69,557 citation instances, verified against CrossRef, OpenAlex, and Semantic Scholar. The variance was "strongly shaped by model, domain, and prompt framing."

Even at the low end, one in nine citations is fabricated. At the high end, more than half are. Enterprises cannot assume citation accuracy based on general model benchmarks. Domain-specific audits are required before deployment.

3. GPT-4 fabricated citations at a 10% rate, down from 25.5% for GPT-3.5

Progress between model generations is real but incomplete. Of 102 references generated by ChatGPT-3.5, 74.5% were real while 25.5% were fabricated. GPT-4 reduced that rate to approximately 10% on general-task queries.

That is a meaningful improvement. But 10% is still unacceptable for compliance, risk, or regulated workflows. Scale a 10% fabrication rate across a 50-page report and you have five phantom sources embedded in a document that looks authoritative.

4. GPT-4 produced fully accurate references only 43.3% of the time in a medical task

In a medical knowledge task, GPT-4 produced references for all answers when prompted. Only 43.3% accurate; 56.7% were incorrect or nonexistent.

The domain matters. General-task performance is better than specialized-domain performance. For health, finance, or legal use cases, LLM-generated references cannot be trusted without verification. More than half may be misleading or fabricated even from a top-tier model.

5. GPT-4o produced 19.9% fabricated citations in a mental health literature review

A 2025 JMIR Mental Health study reported that in simulated mental-health literature reviews, GPT-4o produced 19.9% fabrications. Among the citations that matched real publications, 45.4% contained bibliographic errors, most often incorrect or invalid DOIs.

Nearly two-thirds of all citations generated by GPT-4o were either fabricated or contained errors. In a sensitive, specialized domain, even advanced models show systemic citation unreliability.

6. DOI error rate reached approximately 89% for humanities citations from GPT-3.5

Even where the work itself was real, the metadata quality was extremely poor. GPT-3.5's humanities references had incorrect DOIs in approximately 89% of cases.

For enterprises that depend on linkable, machine-usable citations, LLM outputs often require post-processing and metadata cleanup to be operationally useful. A citation that exists but cannot be retrieved is not much better than one that does not exist at all.

7. Models with 70B+ parameters show 1 to 5% hallucination versus 25%+ for smaller models

A 2024 industry analysis found that models with over 70 billion parameters have a 1 to 5% rate, whereas smaller models can exceed 25%. The correlation is clear: larger, more expensive models produce fewer hallucinated citations.

This creates a cost-accuracy tradeoff. Premium large models meaningfully reduce citation hallucinations but still require verification layers. The question for procurement teams is whether the cost delta justifies the error reduction for their specific use case.

See what AI engines cite about you →

Citation accuracy rates

Citation accuracy is distinct from hallucination. A citation can be real and still be wrong: wrong DOI, wrong author order, wrong page numbers, or wrong claim alignment. The accuracy problem is layered.

8. LLM-assisted citation screening achieved 75% sensitivity and 99% specificity in a medical study

In a 2024 diagnostic study of LLM-assisted citation screening for five clinical questions, sensitivity 0.75, specificity 0.99 when compared against conventional screening.

High specificity means the model reliably filtered out irrelevant citations. But 75% sensitivity means it missed one in four relevant ones. For enterprises using LLMs to triage documents, human oversight remains necessary in critical domains.

9. Sensitivity ranged from 0.25 to 1.00 across individual clinical questions in the same study

Aggregate accuracy figures hide enormous variance. Across individual clinical questions in the same JAMA Network Open study, 0.25 to 1.00 range, while specificity ranged from 0.98 to 0.99.

The same LLM pipeline can be near-perfect on some queries and poor on others. Enterprises should not rely on aggregate accuracy claims alone. They need domain- and task-specific evaluation of LLM citation behavior before deployment.

10. GPT-4o achieved 98.9% consistency with human reviewers in citation screening

A 2025 study comparing GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B reported 98.9% consistency for GPT-4o, 97.8% for Gemini 1.5 Pro, 95.9% for Claude 3.5 Sonnet, and 98.0% for Llama 3.3 70B at 95% CI.

This is one area where frontier models genuinely excel: filtering out irrelevant citations at near-human consistency. For organizations building review or due-diligence workflows, these rates suggest LLMs can substantially reduce manual screening load.

11. 50% of generated search results lacked citations; only 75% of cited results supported claims

A 2024 review on citation accuracy challenges noted that 50% lacked citations, and only 75% of those with citations actually supported the claims they were attached to.

Citation omission is as big a problem as incorrect citation, especially in answer-engine settings. Enterprises adopting LLM search interfaces should track "answers without evidence" and "evidence alignment rate" as core KPIs, not just citation count.

12. 50 to 90% of LLM citations do not fully support the claims they are attached to

The claim-evidence alignment problem is distinct from hallucination and often worse in practice. 50 to 90% misalign with the claims they are attached to, based on a synthesis of multiple studies including Princeton benchmark research.

A formatted reference that does not support the claim is worse than no citation. It creates false confidence. Enterprises should evaluate claim-evidence alignment, not just whether a citation exists.

Track what AI engines cite →

Performance by model type

Model selection is not a commodity decision for citation-intensive workflows. The spread between best and worst performers is wide enough to determine whether a workflow is viable or not.

13. Gemini 1.5 Pro and Claude 3.5 Sonnet achieved sensitivity 0.94 but specificity 0.80 to 0.85

Different models optimize for different goals. Gemini 1.5 Pro and Claude 3.5 Sonnet achieved sensitivity of 0.94 but lower specificity of 0.80 to 0.85, meaning they missed fewer relevant citations but produced more false positives.

GPT-4o and Llama 3.3 70B showed sensitivity 0.85 to 0.88 and specificity 0.93 to 0.97. Enterprises must choose models and thresholds according to whether they prioritize not missing anything (high sensitivity) or minimizing noise (high specificity) in their citation workflows.

14. Hallucination rates span a fivefold range shaped by model, domain, and prompt framing

The cross-model audit explicitly notes that observed hallucination rates span a fivefold range and are "strongly shaped by model, domain, and prompt framing." Vendor selection, domain fine-tuning, and prompt design materially change citation reliability.

Procurement and architecture teams must run benchmarking across their own content and use cases rather than extrapolating from public averages. A model that performs well on general benchmarks may perform poorly on your specific domain.

15. LLM memory scores correlated weakly with citation impact: Spearman rho = 0.1495

The LLM-Metrics study probed 17 LLMs on 549 research papers and found that LLM memory scores had a Spearman rho 0.1495 and Pearson correlation of 0.1446 (log citations) with later citation counts.

LLM internal "memory" of papers is weakly but significantly correlated with human citation impact. Enterprise LLMs could be used as early, imperfect signals of content influence, but the correlations are modest and unsuitable as standalone impact metrics.

Citation verification methods

Verification is not a single step. It is an architecture. The research on verification methods shows that prompt design, model selection, and retrieval augmentation each contribute independently to citation accuracy.

16. Citation accuracy was 0.0% for all models on ambiguous QA benchmarks under standard prompting

Out-of-the-box LLMs failed completely at accurately citing sources in carefully constructed tasks. On AmbigQA-Cite, DisentQA-ParaCite, and DisentQA-DupliCite, citation accuracy was 0.0% for all evaluated models, including GPT-4o-mini and Claude-3.5, under standard prompting.

This is not a model failure. It is a system design failure. Prompt and system design, not just model choice, are essential to achieving usable citation accuracy.

17. Conflict-aware prompting leads to large improvements in citation accuracy

The same factuality study found that conflict-aware prompting improved citation accuracy substantially while maintaining answer accuracy on the ambiguous QA benchmarks.

Explicitly asking the model to handle conflicting evidence can transform citation fidelity. For enterprises, investing in prompt templates and reasoning strategies tailored to citation verification can be as important as upgrading the underlying model, and considerably cheaper.

18. Rigorous verification is mandatory: the conclusion of a peer-reviewed medical journal

The JMIR Mental Health study concludes that due to high fabrication and error rates, "rigorous verification" is mandatory for researchers and students using LLM-generated references. The recommendation includes systematic human verification and potential detection software.

This is not a best practice recommendation from a vendor. It is the conclusion of a peer-reviewed medical journal. For enterprises in regulated industries, this framing has compliance implications.

Run your AI visibility diagnostic →

Industry adoption metrics

Direct, quantified industry-wide adoption statistics for LLM citation workflows remain sparse in the published literature. The following statistics address how LLMs behave as citers in scholarly and enterprise contexts, which is the relevant baseline for adoption decisions.

19. GPT-4o generated 274,951 references for 10,000 focal papers in a 2025 study

The scale of LLM citation generation is already substantial. GPT-4o generated 274,951 references for 10,000 focal papers in a 2025 arXiv study examining how deeply LLMs internalize scientific citations.

This study provides the largest single-model audit of citation generation behavior to date. The volume underscores that LLM citation generation is not a niche use case. It is happening at scale, with or without enterprise governance frameworks in place.

20. LLMs reinforce the Matthew effect by favoring highly cited papers

The same 2025 study found that LLM-generated references show a strong preference for favoring cited papers, reinforcing the Matthew effect in citations. Already well-cited literature gets more citations from LLMs. Newer or niche work gets systematically under-cited.

For content strategy and knowledge management, this means incumbent, authoritative sources gain more AI visibility while emerging research is overlooked. This is not a neutral outcome for organizations trying to surface novel or proprietary research.

Enterprise implementation challenges

The implementation challenges for enterprise LLM citation workflows are not primarily technical. They are architectural: how do you build verification into the workflow without making it so slow or expensive that it defeats the purpose of using LLMs in the first place?

21. Nearly two-thirds of GPT-4o citations in a mental health study were fabricated or erroneous

The JMIR Mental Health finding deserves its own entry because of its implications for enterprise risk. Two-thirds of citations generated by GPT-4o were either fabricated or contained errors in a specialized domain.

Apparent precision, meaning a correctly formatted reference, can mask a high rate of underlying inaccuracy. Organizations must treat LLM citation outputs as untrusted until validated, especially in regulated or reputationally sensitive fields.

22. GhostCite evaluated 375,440 citations across 40 domains, finding no model with zero hallucination

The GhostCite benchmark is the most comprehensive cross-domain citation audit published to date. 375,440 citations from 13 models across 40 domains. No model achieved zero hallucination. The best performer (DeepSeek) still hallucinated 14.23% of citations.

This is the baseline reality for enterprise implementation: there is no model you can deploy without a verification layer. The question is how to design that layer efficiently.

23. Domain specificity drives error: niche areas show systemic citation unreliability even in advanced models

The pattern across multiple studies is consistent. GPT-4o performs better on general tasks than specialized ones. The mental health study, the medical knowledge task, and the humanities DOI error data all point in the same direction: domain drives errors upward, even in frontier models.

For enterprises deploying LLMs in specialized domains (legal, medical, financial, technical), the general-task hallucination benchmarks are not the relevant baseline. Domain-specific audits are required.

Check your AI citation rate →

Benchmark comparisons

The benchmark landscape for LLM citation accuracy is maturing. Several large-scale audits now provide cross-model, cross-domain comparisons that were not available two years ago.

24. GhostCite vs. cross-model audit: best-case hallucination rates of 14.23% and 11.4% respectively

Two major 2025 benchmarks provide different but compatible pictures of best-case citation performance. GhostCite found a minimum 14.23% rate (DeepSeek). The cross-model audit found a minimum of 11.4% across 69,557 citations.

The convergence around 11 to 14% as a best-case floor is significant. Even the most capable models, in their best-performing domains, hallucinate roughly one in seven to one in nine citations. This is the floor, not the average.

25. Consistency scores of 95.9% to 98.9% for frontier models in citation screening tasks

The highest accuracy figures in the literature come from citation screening tasks, not citation generation tasks. 95.9% to 98.9% consistency for GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.3 70B in systematic review screening represent the current ceiling of LLM citation performance.

The distinction matters: screening existing citations for relevance is a different and easier task than generating new citations from scratch. Enterprises should match the task to the capability, not assume that high screening accuracy implies high generation accuracy.

Building citation-reliable AI workflows

The statistics above converge on a clear set of architectural principles for enterprises building LLM workflows that depend on citation accuracy.

Match the task to the capability. LLMs are significantly better at screening existing citations for relevance (95 to 99% consistency) than at generating new citations from scratch (11 to 95% hallucination). Design workflows that use LLMs for what they do well and route generation tasks through verification layers.

Run domain-specific audits before deployment. Public benchmarks measure general-task performance. Your domain is not general. The mental health study, the medical knowledge task, and the humanities DOI data all show that specialized domains produce higher error rates than general benchmarks suggest. Test your models on your content before you trust them with your citations.

Invest in prompt engineering before upgrading models. Conflict-aware prompting transformed citation accuracy from 0% to usable levels on ambiguous QA benchmarks without changing the underlying model. Prompt templates and reasoning strategies are often cheaper and faster to implement than model upgrades, and the evidence suggests they are comparably effective.

Treat claim-evidence alignment as a separate verification step. A citation that exists but does not support the claim is worse than no citation. It creates false confidence. Build verification workflows that check not just whether a source is real, but whether it actually says what the model claims it says.

Size matters, but not enough to skip verification. Models with 70B+ parameters show 1 to 5% hallucination rates versus 25%+ for smaller models. That is a meaningful difference. But even at 1%, a large-scale citation workflow will produce fabricated references. Larger models reduce the verification burden; they do not eliminate it.

The citation accuracy problem is also a visibility problem. Most enterprises do not know what AI engines are citing when buyers ask questions about their category. They do not know whether their own content is being cited, whether competitors are being cited instead, or whether the citations are accurate. AI visibility tracking runs real buyer questions across ChatGPT, Perplexity, Claude, Gemini, Google AI Overviews, and Reddit, logging every citation, every source URL, and every position shift. The inverse of the hallucination problem is the invisibility problem: not "are the citations real?" but "am I being cited at all?"

Run your free AI visibility diagnostic →

FAQ

Questions, answered.

How accurate are LLM citations in general?+
It depends on the model, domain, and prompt design. Hallucination rates range from 11% to 95% across models and domains based on 2025 benchmark data. Even top-tier models like GPT-4o produce fabricated or erroneous citations at rates of 10 to 20% on general tasks, and significantly higher in specialized domains. Verification is mandatory, not optional.
Can I trust GPT-4o citations?+
GPT-4o is substantially better than GPT-3.5 (approximately 10% hallucination versus 25%), but 10% is still unacceptable for compliance or regulated workflows. In the mental health literature review study, nearly two-thirds of GPT-4o citations were either fabricated or contained errors. Always verify against authoritative databases before publishing or acting on LLM-generated references.
Does prompt engineering actually improve citation accuracy?+
Yes, substantially. Conflict-aware prompting improved citation accuracy from 0% to usable levels on ambiguous QA benchmarks without changing the underlying model. Explicit reasoning instructions and prompt templates tailored to citation verification are often cheaper and faster to implement than model upgrades, and the evidence suggests they are comparably effective.
Why do LLMs cite highly cited papers more often?+
LLMs reinforce the Matthew effect by favoring already well-cited literature. A 2025 study generating 274,951 references with GPT-4o found a strong preference for highly cited papers, amplifying existing citation hierarchies. This means newer or niche work is systematically under-cited by LLMs, which has implications for knowledge management and content strategy.
What is the difference between citation hallucination and claim-evidence misalignment?+
Citation hallucination means the cited source does not exist. Claim-evidence misalignment means the source exists but does not actually support the claim it is attached to. Both are problems. Research suggests 50 to 90% of LLM citations do not fully support the claims they are attached to, even when the citations are real.
How should enterprises structure LLM citation verification?+
Three layers work together: domain-specific auditing before deployment to establish baseline hallucination rates, prompt engineering with conflict-aware prompting and explicit reasoning instructions, and systematic cross-referencing against authoritative databases like CrossRef, OpenAlex, or Semantic Scholar. Human review remains necessary in regulated or reputationally sensitive domains. Run your AI visibility diagnostic to see what AI engines are citing about your brand.