Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

To address the high cost of human annotation and failure due to sparse positive trajectories in training LLM agents for long-horizon, domain-specific tasks, this paper proposes Apollo: an asynchronous human-AI collaboration framework that triggers lightweight human intervention only when the agent deviates from critical paths. Apollo introduces action-level trajectory filtering and a multi-strategy trajectory optimization pipeline—integrating supervised control, behavioral cloning, and outcome-driven sampling—to significantly reduce human effort while mitigating error accumulation. Evaluated on InnovatorBench by fine-tuning GLM-4.5, Apollo achieves over 50% performance gain over the no-training baseline and a 28% improvement over a human-free variant. These results demonstrate Apollo’s effectiveness, robustness, and scalability in professional-domain applications.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.

Problem

Research questions and friction points this paper is trying to address.

Training LLM agents for long-horizon specialized tasks is challenging

Current methods are either too expensive or collapse due to rare successes

Human guidance integration and action filtering enable reliable data collection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous human guidance replaces dense annotations

Action-level filtering prevents error propagation

Lightweight intervention sustains long-term agent training

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments