🤖 AI Summary
Large language models (LLMs) face significant challenges in multi-constraint instruction-following tasks, including heavy reliance on external supervision and performance degradation due to sparse rewards. To address these issues, we propose a self-supervised reinforcement learning framework that requires no human annotation. First, it autonomously generates pseudo-labels and fine-grained reward signals from input instructions. Second, it decomposes multi-constraint tasks into independent subtasks via constraint decomposition. Third, it introduces a constraint-wise binary classification reward modeling mechanism to mitigate reward sparsity and enhance generalization in multi-step reasoning. By integrating instruction-driven reward derivation with self-supervised optimization, our method achieves substantial improvements over state-of-the-art approaches across three in-domain and five cross-domain benchmarks—including complex agent behavior and multi-turn instruction-following tasks—demonstrating superior robustness and generalization capability.
📝 Abstract
Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if