SEIF: Self-Evolving Reinforcement Learning for Instruction Following

📅 2026-05-08
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
Existing approaches to improving instruction-following capabilities in large language models often rely on costly human supervision or static instructions, limiting their ability to continuously adapt and improve. This work proposes the first self-evolutionary reinforcement learning framework that operates without external supervision. The framework establishes a closed-loop collaboration among four roles—Instructor, Filter, Follower, and Judger—to dynamically co-evolve instruction difficulty and model capability. By integrating adversarial instruction generation, a data filtering mechanism, and reward-based reinforcement learning, the approach consistently enhances instruction-following performance across diverse model scales and architectures, demonstrating strong generality and effectiveness.
📝 Abstract
Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training with static-difficulty instructions that cannot evolve as the model's capabilities improve. To address these limitations, we propose SEIF (Self-Evolving Reinforcement Learning for Instruction Following), a self-evolving framework for enhancing the instruction-following ability of LLMs. SEIF forms a closed self-evolution loop that improves the model's instruction-following ability, where instruction difficulty evolution and model capability evolution reinforce each other. SEIF consists of four roles: an Instructor that generates increasingly challenging instructions, a Filter that removes conflicting or invalid instructions to ensure data quality, a Follower that learns to follow evolved instructions, and a Judger that provides reward signals for reinforcement learning. The Instructor and Follower are alternately trained and co-evolve throughout the process. Experiments across multiple model scales and architectures show that SEIF consistently improves instruction-following performance, suggesting strong generality. Further analyses reveal the sources of improvement and identify an effective training strategy for self-evolution on open-ended tasks: sufficient early-stage training to build a solid foundation, followed by moderate late-stage training to mitigate overfitting and achieve better final performance. The code and data are publicly available at https://github.com/Rainier-rq1/SEIF.
Problem

Research questions and friction points this paper is trying to address.

instruction following
large language models
self-evolution
reinforcement learning
continuous improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving
reinforcement learning
instruction following
large language models
curriculum learning
Q
Qingyu Ren
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Qianyu He
Qianyu He
Fudan University
Large Language ModelReasoningInstruction FollowingCreative Generation
Jiajie Zhu
Jiajie Zhu
Macquarie University
recommender systemscross-domain recommendation
X
Xingzhou Chen
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
J
Jingwen Chang
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University
Z
Zeye Sun
Ant Group
H
Han Xia
Ant Group
F
Fei Yu
Ant Group
Jiaqing Liang
Jiaqing Liang
Fudan University
knowledge graphdeep learning
Y
Yanghua Xiao
Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University