Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high cost of human annotation and failure due to sparse positive trajectories in training LLM agents for long-horizon, domain-specific tasks, this paper proposes Apollo: an asynchronous human-AI collaboration framework that triggers lightweight human intervention only when the agent deviates from critical paths. Apollo introduces action-level trajectory filtering and a multi-strategy trajectory optimization pipeline—integrating supervised control, behavioral cloning, and outcome-driven sampling—to significantly reduce human effort while mitigating error accumulation. Evaluated on InnovatorBench by fine-tuning GLM-4.5, Apollo achieves over 50% performance gain over the no-training baseline and a 28% improvement over a human-free variant. These results demonstrate Apollo’s effectiveness, robustness, and scalability in professional-domain applications.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo's design in handling long-horizon, domain-specialized tasks.
Problem

Research questions and friction points this paper is trying to address.

Training LLM agents for long-horizon specialized tasks is challenging
Current methods are either too expensive or collapse due to rare successes
Human guidance integration and action filtering enable reliable data collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous human guidance replaces dense annotations
Action-level filtering prevents error propagation
Lightweight intervention sustains long-term agent training
🔎 Similar Papers
No similar papers found.
Dayuan Fu
Dayuan Fu
MS Student, Beijing University of Posts and Telecommunications
LLM Agentspost-trainingNatural Language Processing
Y
Yunze Wu
SJTU, SII, GAIR
X
Xiaojie Cai
SJTU, SII, GAIR
Lyumanshan Ye
Lyumanshan Ye
Shanghai Jiao Tong Univeristy
Human-Computer Interaction
Shijie Xia
Shijie Xia
Shanghai Jiao Tong University
Natural Language Processing
Z
Zhen Huang
SJTU, SII, GAIR
W
Weiye Si
SJTU, SII, GAIR
T
Tianze Xu
SJTU, SII, GAIR
J
Jie Sun
SJTU, SII, GAIR
K
Keyu Li
SJTU, SII, GAIR
Mohan Jiang
Mohan Jiang
Shanghai Jiao Tong University
Agentic SystemMultimodal Large Language Model
Junfei Wang
Junfei Wang
Postdoctoral fellow @ RISE Lab, York University.
Smart GridConvex OptimizationCyber SecurityData-driven Optimization
Q
Qishuo Hua
SJTU, SII, GAIR
P
Pengrui Lu
SJTU, SII, GAIR
Y
Yang Xiao
SJTU, SII, GAIR
P
Pengfei Liu
SJTU, SII, GAIR