🤖 AI Summary
This work addresses the limitations of existing fact-checking benchmarks in handling complex claims requiring cross-table reasoning over large-scale structured data. To bridge this gap, we introduce ClaimDB—the first fact verification benchmark grounded in real-world, multi-domain relational databases, encompassing 80 databases. Claims and their supporting evidence are generated via executable programs, reframing the verification task from textual comprehension to programmatic reasoning. Notably, ClaimDB enables the first systematic evaluation of models’ ability to abstain from answering when uncertain. Evaluations across 30 prominent large language models reveal that all achieve accuracy below 83%, with more than half scoring under 55%, and nearly all exhibit unreliable abstention behavior—highlighting significant shortcomings in high-stakes data analysis scenarios.
📝 Abstract
Despite substantial progress in fact-verification benchmarks, claims grounded in large-scale structured data remain underexplored. In this work, we introduce ClaimDB, the first fact-verification benchmark where the evidence for claims is derived from compositions of millions of records and multiple tables. ClaimDB consists of 80 unique real-life databases covering a wide range of domains, from governance and healthcare to media, education and the natural sciences. At this scale, verification approaches that rely on"reading"the evidence break down, forcing a timely shift toward reasoning in executable programs. We conduct extensive experiments with 30 state-of-the-art proprietary and open-source (below 70B) LLMs and find that none exceed 83% accuracy, with more than half below 55%. Our analysis also reveals that both closed- and open-source models struggle with abstention -- the ability to admit that there is no evidence to decide -- raising doubts about their reliability in high-stakes data analysis. We release the benchmark, code, and the LLM leaderboard at https://claimdb.github.io .