Task-Adaptive Retrieval over Agentic Multi-Modal Web Histories via Learned Graph Memory

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of retrieving task-relevant observations from long-sequence, multimodal web interaction histories, which is hindered by task dynamics, modality heterogeneity, and temporal decay. The authors propose ACGM, a learnable graph-based memory retriever optimized via policy gradient that dynamically models multimodal history through a task-adaptive relevance graph, guided by downstream task success signals. ACGM is the first to enable task-adaptive retrieval of multimodal interaction history, revealing that visual information decays significantly faster than textual content. It incorporates modality-specific temporal decay mechanisms and a sparse graph structure, enabling efficient approximate nearest neighbor retrieval in O(log T) time. Evaluated on WebShop, VisualWebArena, and Mind2Web, ACGM substantially outperforms 19 strong baselines, achieving 82.7 nDCG@10 and 89.2% Precision@10.

Technology Category

Application Category

📝 Abstract

Retrieving relevant observations from long multi-modal web interaction histories is challenging because relevance depends on the evolving task state, modality (screenshots, HTML text, structured signals), and temporal distance. Prior approaches typically rely on static similarity thresholds or fixed-capacity buffers, which fail to adapt relevance to the current task context. We propose \textbf{ACGM}, a learned graph-memory retriever that constructs \emph{task-adaptive} relevance graphs over agent histories using policy-gradient optimization from downstream task success. ACGM captures heterogeneous temporal dynamics with modality-specific decay (visual decays $4.3\times$ faster than text: $λ_v{=}0.47$ vs.\ $λ_x{=}0.11$) and learns sparse connectivity (3.2 edges/node), enabling efficient $O(\log T)$ retrieval. Across WebShop, VisualWebArena, and Mind2Web, ACGM improves retrieval quality to \textbf{82.7 nDCG@10} (+9.3 over GPT-4o, $p{<}0.001$) and \textbf{89.2\% Precision@10} (+7.7), outperforming 19 strong dense, re-ranking, multi-modal, and graph-based baselines. Code to reproduce our results is available at{\color{blue}\href{https://github.com/S-Forouzandeh/ACGM-Agentic-Web}{Saman Forouzandeh}}.

Problem

Research questions and friction points this paper is trying to address.

task-adaptive retrieval

multi-modal web histories

relevance modeling

temporal dynamics

agent memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

task-adaptive retrieval

learned graph memory

multi-modal web histories