Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models face a trade-off between reasoning capability and instruction-following proficiency in complex tasks. Existing approaches rely on stronger external models to provide supervision signals, incurring high computational costs, poor scalability, and limited accessibility. This paper proposes a self-supervised reinforcement learning framework that, for the first time, constructs reward functions exclusively from implicit signals intrinsic to the model’s internal reasoning process—such as chain-of-thought consistency and step-wise plausibility—eliminating the need for external annotations or auxiliary strong models. Through iterative policy optimization, our method simultaneously enhances instruction comprehension and execution fidelity while preserving baseline reasoning performance. Extensive experiments demonstrate its effectiveness, scalability, and cost-efficiency across diverse reasoning architectures. The implementation and datasets are publicly released.

Technology Category

Application Category

📝 Abstract
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.
Problem

Research questions and friction points this paper is trying to address.

Resolves trade-off between reasoning and instruction following
Eliminates need for costly external supervision models
Improves instruction following without sacrificing reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised RL framework
Leverages internal signals
No external supervision needed
🔎 Similar Papers
No similar papers found.
Q
Qingyu Ren
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Qianyu He
Qianyu He
Fudan University
Large Language ModelReasoningInstruction FollowingCreative Generation
Bowei Zhang
Bowei Zhang
Peking University
J
Jie Zeng
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
W
Weikang Zhou
Ant Group
Z
Zeye Sun
Ant Group
F
Fei Yu
Ant Group