🤖 AI Summary
Hallucination in large language models (LLMs) severely undermines their reliability and real-world deployability. Current research is fragmented into two incompatible paradigms: model-centric hallucination detection (HD) and text-centric fact verification (FV), differing fundamentally in assumptions, data construction, and evaluation protocols. To bridge this gap, we propose UniFact—the first unified evaluation framework enabling instance-level, directly comparable assessment of both HD and FV. Leveraging dynamically generated–annotated paired data, we systematically uncover their complementary strengths and demonstrate that hybrid HD+FV methods substantially outperform either paradigm alone, achieving new state-of-the-art performance. We open-source all code, datasets, and baseline systems to foster convergence in hallucination research, establishing unified evaluation and collaborative modeling as the new standard.
📝 Abstract
Large Language Models (LLMs) frequently exhibit hallucinations, generating content that appears fluent and coherent but is factually incorrect. Such errors undermine trust and hinder their adoption in real-world applications. To address this challenge, two distinct research paradigms have emerged: model-centric Hallucination Detection (HD) and text-centric Fact Verification (FV). Despite sharing the same goal, these paradigms have evolved in isolation, using distinct assumptions, datasets, and evaluation protocols. This separation has created a research schism that hinders their collective progress. In this work, we take a decisive step toward bridging this divide. We introduce UniFact, a unified evaluation framework that enables direct, instance-level comparison between FV and HD by dynamically generating model outputs and corresponding factuality labels. Through large-scale experiments across multiple LLM families and detection methods, we reveal three key findings: (1) No paradigm is universally superior; (2) HD and FV capture complementary facets of factual errors; and (3) hybrid approaches that integrate both methods consistently achieve state-of-the-art performance. Beyond benchmarking, we provide the first in-depth analysis of why FV and HD diverged, as well as empirical evidence supporting the need for their unification. The comprehensive experimental results call for a new, integrated research agenda toward unifying Hallucination Detection and Fact Verification in LLMs. We have open-sourced all the code, data, and baseline implementation at: https://github.com/oneal2000/UniFact/