🤖 AI Summary
Large language models face a trade-off between reasoning capability and instruction-following proficiency in complex tasks. Existing approaches rely on stronger external models to provide supervision signals, incurring high computational costs, poor scalability, and limited accessibility. This paper proposes a self-supervised reinforcement learning framework that, for the first time, constructs reward functions exclusively from implicit signals intrinsic to the model’s internal reasoning process—such as chain-of-thought consistency and step-wise plausibility—eliminating the need for external annotations or auxiliary strong models. Through iterative policy optimization, our method simultaneously enhances instruction comprehension and execution fidelity while preserving baseline reasoning performance. Extensive experiments demonstrate its effectiveness, scalability, and cost-efficiency across diverse reasoning architectures. The implementation and datasets are publicly released.
📝 Abstract
Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.