๐ค AI Summary
Large language models (LLMs) often produce responses that partially satisfy instruction constraints, leading to sparse reward signals and low sample efficiency in instruction-following reinforcement learning.
Method: We propose Hindsight Instruction Replay (HiR), a novel framework that (i) retrospectively reconstructs failed responses by grouping them according to satisfied constraints to generate high-quality pseudo-positive samples; (ii) formulates dual-granularity preference learning objectivesโboth instruction-level and response-level; and (iii) performs efficient optimization using only binary reward signals. HiR integrates constraint-aware filtering-and-rewriting, reinforcement-learning-driven sample replay, and dual preference modeling.
Results: HiR achieves significant performance gains across diverse complex instruction-following tasks. It attains comparable or superior results with over 30% reduction in sampling and computational overhead. We publicly release our code and dataset to foster reproducibility and further research.
๐ Abstract
Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.