Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Large language models face a trade-off between reasoning capability and instruction-following proficiency in complex tasks. Existing approaches rely on stronger external models to provide supervision signals, incurring high computational costs, poor scalability, and limited accessibility. This paper proposes a self-supervised reinforcement learning framework that, for the first time, constructs reward functions exclusively from implicit signals intrinsic to the model’s internal reasoning process—such as chain-of-thought consistency and step-wise plausibility—eliminating the need for external annotations or auxiliary strong models. Through iterative policy optimization, our method simultaneously enhances instruction comprehension and execution fidelity while preserving baseline reasoning performance. Extensive experiments demonstrate its effectiveness, scalability, and cost-efficiency across diverse reasoning architectures. The implementation and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Reasoning models excel in complex problem solving but exhibit a concerning trade off between reasoning capabilities and instruction following abilities. Existing approaches for improving instruction following rely on stronger external models, creating methodological bottlenecks and practical limitations including increased costs and accessibility constraints. We propose a self-supervised RL framework that leverages reasoning models' own internal signals to improve instruction following capabilities without external supervision. Extensive experiments demonstrate that our framework significantly improves instruction following capabilities while maintaining reasoning performance, offering a scalable and cost-effective approach to enhance instruction following in reasoning models. The data and code are publicly available at https://github.com/Rainier-rq/verl-if.

Problem

Research questions and friction points this paper is trying to address.

Resolves trade-off between reasoning and instruction following

Eliminates need for costly external supervision models

Improves instruction following without sacrificing reasoning performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised RL framework

Leverages internal signals

No external supervision needed

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting