Think Right, Not More: Test-Time Scaling for Numerical Claim Verification

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Large language models (LLMs) suffer from reasoning drift when fact-checking complex numerical claims, struggling to preserve numerical semantics and logical consistency. To address this, we propose VerifierFC: a framework that (1) generates multiple reasoning paths via test-time computation; (2) trains a dedicated verifier model to score and filter these paths; and (3) incorporates an adaptive resource allocation mechanism that dynamically allocates computational budget based on claim complexity. This design jointly optimizes accuracy and efficiency. On numerical claim verification benchmarks, VerifierFC achieves an 18.8% absolute accuracy gain over single-path inference while attaining 1.8× higher computational efficiency than standard test-time computation—outperforming state-of-the-art methods. Its core innovations lie in the integration of multi-path reasoning, learnable path verification, and adaptive computation scheduling, collectively mitigating reasoning drift in a principled, systematic manner.

Technology Category

Application Category

📝 Abstract

Fact-checking real-world claims, particularly numerical claims, is inherently complex that require multistep reasoning and numerical reasoning for verifying diverse aspects of the claim. Although large language models (LLMs) including reasoning models have made tremendous advances, they still fall short on fact-checking real-world claims that require a combination of compositional and numerical reasoning. They are unable to understand nuance of numerical aspects, and are also susceptible to the reasoning drift issue, where the model is unable to contextualize diverse information resulting in misinterpretation and backtracking of reasoning process. In this work, we systematically explore scaling test-time compute (TTS) for LLMs on the task of fact-checking complex numerical claims, which entails eliciting multiple reasoning paths from an LLM. We train a verifier model (VERIFIERFC) to navigate this space of possible reasoning paths and select one that could lead to the correct verdict. We observe that TTS helps mitigate the reasoning drift issue, leading to significant performance gains for fact-checking numerical claims. To improve compute efficiency in TTS, we introduce an adaptive mechanism that performs TTS selectively based on the perceived complexity of the claim. This approach achieves 1.8x higher efficiency than standard TTS, while delivering a notable 18.8% performance improvement over single-shot claim verification methods. Our code and data can be found at https://github.com/VenkteshV/VerifierFC

Problem

Research questions and friction points this paper is trying to address.

Fact-checking numerical claims requires multistep reasoning

LLMs struggle with numerical nuance and reasoning drift

Test-time compute scaling improves verification accuracy and efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time scaling elicits multiple reasoning paths

Verifier model selects correct reasoning path

Adaptive mechanism improves compute efficiency selectively

🔎 Similar Papers

No similar papers found.