CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the poor generalization and low robustness of existing answer verification methods for computationally intensive scientific reasoning—such as algebraic equivalence checking and physical constant substitution—this paper proposes a tool-augmented two-stage verification framework. In Stage I, cold-start fine-tuning establishes foundational verification capability. In Stage II, multi-turn tool invocation—integrating symbolic computation and numerical execution engines—is combined with verifiability-driven reinforcement learning and a fine-grained reward mechanism to enable automated, interpretable verification. The framework significantly improves language models’ judgment accuracy on complex scientific computation tasks, achieving state-of-the-art (SOTA) performance on VerifyBench-Hard and SCI-Bench. As a reward model, it consistently outperforms all baseline verifiers—including those grounded in scoring rubrics or other models—on AIME’24 and AIME’25.

Technology Category

Application Category

📝 Abstract

Answer verification methods are widely employed in language model training pipelines spanning data curation, evaluation, and reinforcement learning with verifiable rewards (RLVR). While prior work focus on developing unified verifiers applicable across multiple reasoning scenarios, significant challenges remain in computation-oriented scientific domains, such as algebraic equivalence checking and physical constant substitution. In this paper, we introduce model, a tool-augmented verifier that leverages external executors to perform precise computations and symbolic simplifications. model enables robust verification that goes beyond simple semantic matching. We propose a novel two-stage pipeline, which begin with cold-start fine-tuning and followed by multi-turn reinforcement learning with tool integration. Extensive experiments conducted on STEM subjects, general QA, and long-form reasoning tasks demonstrates strong generalization of model. The results shows that the model achieves state-of-the-art performance on VerifyBench-Hard and SCI-Bench. And we also employ our model in RLVR as a reward model, the results show that it consistently outperforms both rubric-based and model-based verifiers on AIME'24 and AIME'25, demonstrating strong potential to enhance reasoning capabilities of LLM. Our model is released at hyperlink{https://huggingface.co/Nanbeige/CoSineVerifier-Tool-4B}{https://huggingface.co/Nanbeige/CoSineVerifier-Tool-4B}.

Problem

Research questions and friction points this paper is trying to address.

Developing robust answer verification for computation-oriented scientific questions

Addressing algebraic equivalence checking and physical constant substitution challenges

Enhancing verification beyond simple semantic matching using external tools

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-augmented verifier using external executors for precise computations

Two-stage pipeline with cold-start fine-tuning and multi-turn reinforcement learning

Employed as reward model in reinforcement learning to enhance reasoning

🔎 Similar Papers

CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks

2024-06-04arXiv.orgCitations: 6

Authors to Follow