Zero Reinforcement Learning Towards General Domains

📅 2025-10-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot reinforcement learning (Zero-RL) approaches are confined to reward-verifiable domains—such as mathematics and programming—where correctness can be objectively assessed, limiting their applicability to diverse reasoning tasks with non-verifiable rewards. Method: We propose the first Zero-RL paradigm for non-verifiable domains, enabling cross-domain reasoning transfer via joint training on verifiable tasks and non-verifiable tasks guided by a generative reward model. We introduce a novel smooth length penalty to mitigate reward hacking and integrate Zero-RL, generative reward modeling, multi-task co-optimization, and robust regularization. Contribution/Results: Evaluated on Qwen3-8B and Qwen3-14B base models, our method significantly improves performance on complex reasoning benchmarks (e.g., MMLU, GPQA) and general-purpose tasks (e.g., BIG-Bench Hard), demonstrating strong generalization and practical efficacy in settings lacking explicit reward signals.

Technology Category

Application Category

📝 Abstract
Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.
Problem

Research questions and friction points this paper is trying to address.

Extends zero-RL to domains without verifiable reward signals
Enhances reasoning transfer between verifiable and non-verifiable domains
Addresses reward hacking through smooth length penalty design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines verifiable rewards with generative reward model
Uses multi-task zero-RL training across domains
Implements smooth length penalty to prevent reward hacking
🔎 Similar Papers
No similar papers found.
Y
Yuyuan Zeng
LLM Department, Tencent
Y
Yufei Huang
LLM Department, Tencent
C
Can Xu
LLM Department, Tencent
Qingfeng Sun
Qingfeng Sun
Tencent Hunyuan X
Natural Language Processing
J
Jianfeng Yan
LLM Department, Tencent
Guanghui Xu
Guanghui Xu
LLM Department, Tencent
T
Tao Yang
LLM Department, Tencent
F
Fengzong Lian
LLM Department, Tencent