M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of multimodal large language models in long-horizon web tasks, where high computational overhead and weak reasoning capabilities often hinder performance. The authors propose a training-free, dual-memory augmentation framework that introduces, for the first time, an internal–external memory mechanism: internal memory dynamically compresses interaction history through trajectory summarization, while external memory retrieves actionable insights from an offline knowledge base to inform decision-making. This approach substantially improves both task efficiency and success rates, achieving up to a 19.6% absolute gain in success rate on WebVoyager and OnlineMind2Web benchmarks. Furthermore, it reduces token consumption by 58.7% on Qwen3-VL-32B and yields up to a 12.5% accuracy improvement even on closed-source models such as Claude.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
Problem

Research questions and friction points this paper is trying to address.

long-horizon web navigation
multimodal large language models
memory augmentation
trajectory summarization
insight retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Memory Augmentation
Trajectory Summarization
Insight Retrieval
Training-Free Framework
Long-Horizon Web Navigation
🔎 Similar Papers