SCI-Verifier: Scientific Verifier with Thinking

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scientific large language models (LLMs) face critical challenges in answer verification—including complex output formats, diverse semantically equivalent expressions, overreliance on manual rules or prompt engineering in existing methods, and poor cross-disciplinary generalization. Method: This paper introduces SCI-VerifyBench, the first multi-disciplinary, real-world scientific question-answering verification benchmark, and proposes SCI-Verifier, a unified reasoning-based verification model. SCI-Verifier integrates domain-specific equivalence transformation generation, expert-collaborative annotation, and a post-training-driven reasoning enhancement architecture to jointly support logical inference and semantic equivalence assessment. Contribution/Results: Experiments demonstrate that SCI-Verifier significantly improves verification accuracy and robustness across diverse scientific disciplines, yielding concise and stable outputs. It overcomes the dual limitations of prior approaches—limited generalizability and insufficient formal rigor—establishing a new state of the art in scientific answer verification.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) are increasingly applied to scientific reasoning, the complexity of answer formats and the diversity of equivalent expressions make answer verification a critical yet challenging task. Existing verification studies in scientific domains suffer from two major limitations: (a) the absence of systematic evaluation standards and insufficient disciplinary coverage, which hinders their comprehensive assessment; and (b) heavy reliance on cumbersome rule design or prompt engineering, which reduces their effectiveness in complex reasoning scenarios or limits their cross-disciplinary generalization. To address these challenges, we propose solutions at both the data and model levels. On the data side, we construct SCI-VerifyBench, a cross-disciplinary benchmark covering mathematics, physics, biology, chemistry, and general scientific QA. The benchmark is built from real LLM responses and enhanced with domain-specific equivalence transformations that generate challenging and realistic data. Model-based and expert annotations ensure both quality and diversity, enabling rigorous evaluation of verification ability. On the model side, we emphasize the importance of reasoning for verification and introduce SCI-Verifier, a unified reasoning-augmented verifier for scientific domains. Through post-training, SCI-Verifier demonstrates strong logical reasoning and equivalence judgment capabilities while maintaining concise and stable outputs. Together, SCI-VerifyBench and SCI-Verifier provide a principled framework for scientific verification, offering both systematic evaluation and practical pathways to enhance the reliability and applicability of LLMs in scientific domains.
Problem

Research questions and friction points this paper is trying to address.

Verifying diverse scientific answers from language models
Addressing limitations in existing verification methods and standards
Enhancing reliability of scientific reasoning through improved verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-disciplinary benchmark with domain-specific equivalence transformations
Reasoning-augmented verifier through post-training for logical equivalence
Unified framework combining systematic evaluation and practical verification