🤖 AI Summary
This work addresses the challenge faced by long-horizon large language model agents in sparse-reward environments, where identifying and correcting critical intermediate actions responsible for task failure remains difficult. To this end, the authors propose a targeted self-distillation framework grounded in post-hoc analysis of complete trajectories. By applying feedback-conditioned distillation exclusively to action segments associated with failure, the method generates precise and efficient supervision signals. This approach innovatively integrates post-hoc attribution with selective distillation, circumventing the need for feedback generation over entire trajectories. Evaluated on the BFCL v3 and AppWorld benchmarks, the proposed method achieves up to an 18.80% performance gain over dense per-turn feedback baselines while reducing per-step training time to approximately 44% of the baseline.
📝 Abstract
Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.