🤖 AI Summary
This paper addresses the challenge of inaccurate truth assessment in automated verification of numeric factual claims—those involving quantities, comparisons, and temporal expressions. To tackle this, we propose a verification framework integrating evidence retrieval with natural language inference (NLI). Through systematic evaluation, we analyze the impact of context length and tokenization strategies (e.g., right-to-left, R2L) on numeric reasoning, finding that neither extended context nor R2L tokenization improves performance—indicating that evidence quality, rather than model architecture, constitutes the primary bottleneck. Methodologically, we design a lightweight, QuanTemp-adapted evidence retrieval pipeline and combine it with ModernBERT and an NLI classifier for end-to-end verification. Our system achieves a macro-averaged F1-score of 0.57 on CheckThat! 2025 Task 3, ranking among the top four submissions. The code is publicly released.
📝 Abstract
Numerical claims, statements involving quantities, comparisons, and temporal references, pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the effect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.