Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face significant challenges in multi-constraint instruction-following tasks, including heavy reliance on external supervision and performance degradation due to sparse rewards. To address these issues, we propose a self-supervised reinforcement learning framework that requires no human annotation. First, it autonomously generates pseudo-labels and fine-grained reward signals from input instructions. Second, it decomposes multi-constraint tasks into independent subtasks via constraint decomposition. Third, it introduces a constraint-wise binary classification reward modeling mechanism to mitigate reward sparsity and enhance generalization in multi-step reasoning. By integrating instruction-driven reward derivation with self-supervised optimization, our method achieves substantial improvements over state-of-the-art approaches across three in-domain and five cross-domain benchmarks—including complex agent behavior and multi-turn instruction-following tasks—demonstrating superior robustness and generalization capability.

Technology Category

Application Category

📝 Abstract
Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if
Problem

Research questions and friction points this paper is trying to address.

Language models struggle with multi-constraint instruction following
Reinforcement learning suffers from external supervision dependency
Sparse reward signals challenge multi-constraint task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised RL framework eliminates external supervision dependency
Constraint decomposition strategies address sparse reward challenges
Constraint-wise binary classification maintains computational efficiency
🔎 Similar Papers
No similar papers found.
Q
Qingyu Ren
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Qianyu He
Qianyu He
Fudan University
Large Language ModelReasoningInstruction FollowingCreative Generation
Bowei Zhang
Bowei Zhang
Peking University
J
Jie Zeng
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
W
Weikang Zhou
Ant Group
Z
Zeye Sun
Ant Group
F
Fei Yu
Ant Group