🤖 AI Summary
This work addresses the challenge that language agents struggle to effectively distill task experiences into reusable knowledge in continual settings, as existing memory systems couple online acquisition and integration, lacking global abstraction across sessions. Inspired by complementary learning systems, the paper proposes Auto-Dreamer—the first learnable offline memory consolidation mechanism for language agents—that decouples rapid within-session memory encoding from slow cross-session integration. It generates compact, abstract memory surrogates by analyzing tool-use trajectories. The framework jointly optimizes a typed memory bank, bounded tool invocations, and source trajectory attribution via GRPO, using end-to-end task performance as reward. Evaluated on ScienceWorld, Auto-Dreamer outperforms the strongest baseline by 7 percentage points while compressing active memory usage by 12×. Without fine-tuning, it maintains state-of-the-art performance on ALFWorld and WebArena, reducing memory consumption in ALFWorld by 6×.
📝 Abstract
Language agents increasingly operate over streams of related tasks, yet existing memory systems struggle to convert accumulated experience into reusable knowledge. Retrieval-augmented and structured memory methods record per-session observations effectively, but often couple acquisition and consolidation into a single online process, leaving the agent without a global view across sessions to discover recurring patterns, abstract shared procedures, or prune redundant entries. Inspired by complementary learning systems theory, we propose Auto-Dreamer, a learned offline consolidator for language-agent memory. Auto-Dreamer decouples fast per-session memory acquisition from slow cross-session consolidation. Given a selected working region of a typed memory bank, the consolidator treats the region as read-only evidence, performs bounded tool-use to inspect entries and provenance-linked source trajectories, and synthesizes a fresh compact replacement set that abstracts across sessions and supersedes the original region. We train Auto-Dreamer via GRPO, using end-to-end agent performance as the reward signal to learn how to consolidate memories acquired through fast online experience. Trained on ScienceWorld trajectories alone, Auto-Dreamer outperforms fixed, RL-trained, and prompted memory baselines on ScienceWorld by 7 points while using an active memory bank 12$\times$ smaller than the strongest baseline, and continues to lead on held-out ALFWorld and WebArena without retraining -- using 6$\times$ less memory than the strongest baseline on ALFWorld.