🤖 AI Summary
To address the challenges of LLM agents struggling to adapt to unfamiliar domains and relying on costly online interaction or fine-tuning, this paper proposes a lightweight, offline trajectory-based prompt augmentation method. Our approach features three key contributions: (1) trajectory distillation coupled with state-matching retrieval to extract critical decision cues from long, noisy raw trajectories; (2) context-aware feedback synthesis by jointly leveraging both successful and failed trajectories—enabling effective learning even from failure-only data; and (3) a scalable, parallelizable prompt generation framework with an adaptive scaling mechanism, supporting benchmark-agnostic prompt construction. Evaluated on MiniWoB++, WorkArena-L1, and WebArena-Lite, our method significantly outperforms strong baselines—including handcrafted prompts and documentation-based prompting—demonstrating superior efficiency, cross-domain generalization, and deployment practicality.
📝 Abstract
Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.