CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions

šŸ“… 2025-11-30
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
To address the poor generalization and low robustness of existing answer verification methods for computationally intensive scientific reasoning—such as algebraic equivalence checking and physical constant substitution—this paper proposes a tool-augmented two-stage verification framework. In Stage I, cold-start fine-tuning establishes foundational verification capability. In Stage II, multi-turn tool invocation—integrating symbolic computation and numerical execution engines—is combined with verifiability-driven reinforcement learning and a fine-grained reward mechanism to enable automated, interpretable verification. The framework significantly improves language models’ judgment accuracy on complex scientific computation tasks, achieving state-of-the-art (SOTA) performance on VerifyBench-Hard and SCI-Bench. As a reward model, it consistently outperforms all baseline verifiers—including those grounded in scoring rubrics or other models—on AIME’24 and AIME’25.

Technology Category

Application Category

šŸ“ Abstract
Answer verification methods are widely employed in language model training pipelines spanning data curation, evaluation, and reinforcement learning with verifiable rewards (RLVR). While prior work focus on developing unified verifiers applicable across multiple reasoning scenarios, significant challenges remain in computation-oriented scientific domains, such as algebraic equivalence checking and physical constant substitution. In this paper, we introduce model, a tool-augmented verifier that leverages external executors to perform precise computations and symbolic simplifications. model enables robust verification that goes beyond simple semantic matching. We propose a novel two-stage pipeline, which begin with cold-start fine-tuning and followed by multi-turn reinforcement learning with tool integration. Extensive experiments conducted on STEM subjects, general QA, and long-form reasoning tasks demonstrates strong generalization of model. The results shows that the model achieves state-of-the-art performance on VerifyBench-Hard and SCI-Bench. And we also employ our model in RLVR as a reward model, the results show that it consistently outperforms both rubric-based and model-based verifiers on AIME'24 and AIME'25, demonstrating strong potential to enhance reasoning capabilities of LLM. Our model is released at hyperlink{https://huggingface.co/Nanbeige/CoSineVerifier-Tool-4B}{https://huggingface.co/Nanbeige/CoSineVerifier-Tool-4B}.
Problem

Research questions and friction points this paper is trying to address.

Developing robust answer verification for computation-oriented scientific questions
Addressing algebraic equivalence checking and physical constant substitution challenges
Enhancing verification beyond simple semantic matching using external tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented verifier using external executors for precise computations
Two-stage pipeline with cold-start fine-tuning and multi-turn reinforcement learning
Employed as reward model in reinforcement learning to enhance reasoning
šŸ”Ž Similar Papers
R
Ruixiang Feng
University of Electronic Science and Technology of China, Chengdu, China
Z
Zhenwei An
Nanbeige Lab, BOSS Zhipin
Y
Yuntao Wen
University of Electronic Science and Technology of China, Chengdu, China
R
Ran Le
Nanbeige Lab, BOSS Zhipin
Y
Yiming Jia
Nanbeige Lab, BOSS Zhipin
C
Chen Yang
Nanbeige Lab, BOSS Zhipin
Z
Zongchao Chen
Nanbeige Lab, BOSS Zhipin
L
Lisi Chen
University of Electronic Science and Technology of China, Chengdu, China
S
Shen Gao
University of Electronic Science and Technology of China, Chengdu, China
Shuo Shang
Shuo Shang
Computer Science & AI Scientist
Spatial dataSpatiotemporal databases
Y
Yang Song
Nanbeige Lab, BOSS Zhipin
T
Tao Zhang
Nanbeige Lab, BOSS Zhipin