Reinforcing General Reasoning without Verifiers

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing verifier-based reinforcement learning (RL) paradigms for large language models (LLMs) rely on rule-based answer verification, limiting generalization to open-domain tasks such as medicine, law, and engineering; alternatively, using LLMs as external verifiers introduces strong model dependency, vulnerability to reward hacking, and substantial GPU memory overhead. Method: We propose VeriFree—the first verifier-free, single-model RL training framework—that unifies policy optimization and implicit verification within a single model. Inspired by DeepSeek-R1-Zero, it directly maximizes the generation probability of reference answers and formalizes implicit verification via variational optimization. Contribution/Results: On MMLU-Pro, GPQA, SuperGPQA, and mathematical reasoning benchmarks, VeriFree matches or surpasses verifier-based methods while reducing GPU memory consumption by over 40%, significantly improving both training efficiency and cross-domain generalization.

Technology Category

Application Category

📝 Abstract
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Problem

Research questions and friction points this paper is trying to address.

Extends RL training to general reasoning without verifiers
Addresses limitations of rule-based answer verification
Reduces reliance on strong verifier LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Verifier-free RL for general reasoning tasks
Directly maximizes reference answer probability
Unifies policy and implicit verifier training
🔎 Similar Papers
No similar papers found.
Xiangxin Zhou
Xiangxin Zhou
Unknown affiliation
Z
Zichen Liu
National University of Singapore
Anya Sims
Anya Sims
University of Oxford
Reinforcement LearningDeep Learning
H
Haonan Wang
Sea AI Lab, Singapore
T
Tianyu Pang
Sea AI Lab, Singapore
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning
L
Liang Wang
University of Chinese Academy of Sciences; Institute of Automation, Chinese Academy of Sciences
Min Lin
Min Lin
Principal Research Scientist, Sea AI Lab
Artificial Intelligence
C
Chao Du
Sea AI Lab, Singapore