A Benchmark for Open-Domain Numerical Fact-Checking Enhanced by Claim Decomposition

๐Ÿ“… 2025-10-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing numerical fact-checking methods fail to adequately handle numerical claims expressed in natural language, and mainstream benchmarks rely on heuristic claim decomposition and weakly supervised web searchโ€”leading to low evidence relevance, noisy sources, and temporal leakage, thereby compromising evaluation validity. To address these issues, we propose QuanTemp++, the first high-quality benchmark for open-domain numerical fact-checking. It models human verification behavior by designing a principled claim decomposition strategy, integrating weakly supervised retrieval with manual curation to ensure temporally sound, highly relevant evidence. We systematically evaluate the impact of diverse decomposition paradigms on retrieval and verification performance, and release an open-source dataset comprising naturally occurring numerical claims, precisely aligned evidence, and verified truth labels. Experimental results demonstrate that well-designed claim decomposition significantly improves fact-checking accuracy, establishing a more realistic, rigorous, and reproducible evaluation standard for numerical fact-checking.

Technology Category

Application Category

๐Ÿ“ Abstract
Fact-checking numerical claims is critical as the presence of numbers provide mirage of veracity despite being fake potentially causing catastrophic impacts on society. The prior works in automatic fact verification do not primarily focus on natural numerical claims. A typical human fact-checker first retrieves relevant evidence addressing the different numerical aspects of the claim and then reasons about them to predict the veracity of the claim. Hence, the search process of a human fact-checker is a crucial skill that forms the foundation of the verification process. Emulating a real-world setting is essential to aid in the development of automated methods that encompass such skills. However, existing benchmarks employ heuristic claim decomposition approaches augmented with weakly supervised web search to collect evidences for verifying claims. This sometimes results in less relevant evidences and noisy sources with temporal leakage rendering a less realistic retrieval setting for claim verification. Hence, we introduce QuanTemp++: a dataset consisting of natural numerical claims, an open domain corpus, with the corresponding relevant evidence for each claim. The evidences are collected through a claim decomposition process approximately emulating the approach of human fact-checker and veracity labels ensuring there is no temporal leakage. Given this dataset, we also characterize the retrieval performance of key claim decomposition paradigms. Finally, we observe their effect on the outcome of the verification pipeline and draw insights. The code for data pipeline along with link to data can be found at https://github.com/VenkteshV/QuanTemp_Plus
Problem

Research questions and friction points this paper is trying to address.

Developing benchmark for open-domain numerical fact verification
Addressing limitations of heuristic claim decomposition methods
Ensuring realistic evidence retrieval without temporal leakage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Claim decomposition mimics human fact-checker approach
Dataset prevents temporal leakage in evidence collection
Benchmark evaluates retrieval and verification pipeline performance
๐Ÿ”Ž Similar Papers
No similar papers found.