Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models (LLMs) often produce responses that partially satisfy instruction constraints, leading to sparse reward signals and low sample efficiency in instruction-following reinforcement learning. Method: We propose Hindsight Instruction Replay (HiR), a novel framework that (i) retrospectively reconstructs failed responses by grouping them according to satisfied constraints to generate high-quality pseudo-positive samples; (ii) formulates dual-granularity preference learning objectives—both instruction-level and response-level; and (iii) performs efficient optimization using only binary reward signals. HiR integrates constraint-aware filtering-and-rewriting, reinforcement-learning-driven sample replay, and dual preference modeling. Results: HiR achieves significant performance gains across diverse complex instruction-following tasks. It attains comparable or superior results with over 30% reduction in sampling and computational overhead. We publicly release our code and dataset to foster reproducibility and further research.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at https://github.com/sastpg/HIR.

Problem

Research questions and friction points this paper is trying to address.

Improves RL for aligning LLMs with instruction constraints

Addresses sparse rewards from initial model's limited response generation

Enhances sample efficiency in complex instruction-following tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replay failed attempts as successes via hindsight

Select-then-rewrite strategy for sample efficiency

Dual-preference learning with binary reward signal

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance