Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

๐Ÿ“… 2025-12-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language models (LLMs) often produce responses that partially satisfy instruction constraints, leading to sparse reward signals and low sample efficiency in instruction-following reinforcement learning. Method: We propose Hindsight Instruction Replay (HiR), a novel framework that (i) retrospectively reconstructs failed responses by grouping them according to satisfied constraints to generate high-quality pseudo-positive samples; (ii) formulates dual-granularity preference learning objectivesโ€”both instruction-level and response-level; and (iii) performs efficient optimization using only binary reward signals. HiR integrates constraint-aware filtering-and-rewriting, reinforcement-learning-driven sample replay, and dual preference modeling. Results: HiR achieves significant performance gains across diverse complex instruction-following tasks. It attains comparable or superior results with over 30% reduction in sampling and computational overhead. We publicly release our code and dataset to foster reproducibility and further research.

Technology Category

Application Category

๐Ÿ“ Abstract
Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.
Problem

Research questions and friction points this paper is trying to address.

Improves RL for aligning LLMs with instruction constraints
Addresses sparse rewards from initial model's limited response generation
Enhances sample efficiency in complex instruction-following tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replay failed attempts as successes via hindsight
Select-then-rewrite strategy for sample efficiency
Dual-preference learning with binary reward signal
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kongcheng Zhang
Zhejiang University
Q
Qi Yao
Cainiao Network
Shunyu Liu
Shunyu Liu
Nanyang Technological University
Multi-Agent LearningReinforcement LearningLarge Language ModelsPower System Control
W
Wenjian Zhang
Dalian University of Technology
Min Cen
Min Cen
University of Science and Technology of China
Y
Yang Zhou
Zhejiang University
W
Wenkai Fang
Zhejiang University
Yiru Zhao
Yiru Zhao
Alibaba DAMO Academy
Computer Vision
B
Baisheng Lai
Chinese Academy of Sciences
M
Mingli Song
Zhejiang University