HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge faced by long-horizon large language model agents in sparse-reward environments, where identifying and correcting critical intermediate actions responsible for task failure remains difficult. To this end, the authors propose a targeted self-distillation framework grounded in post-hoc analysis of complete trajectories. By applying feedback-conditioned distillation exclusively to action segments associated with failure, the method generates precise and efficient supervision signals. This approach innovatively integrates post-hoc attribution with selective distillation, circumventing the need for feedback generation over entire trajectories. Evaluated on the BFCL v3 and AppWorld benchmarks, the proposed method achieves up to an 18.80% performance gain over dense per-turn feedback baselines while reducing per-step training time to approximately 44% of the baseline.

📝 Abstract

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

Problem

Research questions and friction points this paper is trying to address.

long-horizon agents

reinforcement learning

sparse rewards

intermediate actions

feedback efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

hindsight self-distillation

long-horizon agents

targeted distillation