🤖 AI Summary
Existing fact-checking evaluations suffer from annotation errors, semantic ambiguity, and suboptimal baseline selection—factors that significantly distort model rankings (e.g., ambiguity-induced ranking shifts up to 16%). Method: We systematically benchmark 12 large language models (LLMs) and one specialized verifier across 14 fact-checking benchmarks, and propose LLM-as-a-judge—a novel automated data-cleaning framework that quantifies ambiguity’s impact for the first time. We establish few-shot prompted state-of-the-art LLMs as strong baselines and augment small models with synthetically generated multi-hop reasoning data. Contribution/Results: Our approach improves small-model accuracy by 12.7% on complex reasoning subsets. We publicly release evaluation code, fine-tuned small models, and a high-quality cleaned dataset. Crucially, we identify and rectify key methodological flaws in current evaluation practices, providing both theoretical foundations and practical pathways toward robust, cost-efficient, and trustworthy fact-checking systems.
📝 Abstract
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers