Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing memory-augmented agents rely on unimodal trajectory storage, suffering from brevity bias and insufficient multimodal representation—hindering synergistic learning of visual attention and logical reasoning. To address this, we propose ViLoMem, the first dual-stream memory framework that explicitly decouples visual distraction patterns from logical reasoning errors, enabling incremental construction of multimodal semantic memory. ViLoMem integrates a grow-and-refine mechanism, error-aware memory updating, trajectory replay, and abstract knowledge distillation to ensure stable generalization and mitigate catastrophic forgetting. Evaluated on six mainstream multimodal benchmarks, ViLoMem achieves significant improvements in pass@1 accuracy, effectively suppressing repetitive visual distractions and logical hallucinations. Empirical results demonstrate its long-term efficacy in continual learning and cross-domain reasoning.

Technology Category

Application Category

📝 Abstract
MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge -- preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction--hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.
Problem

Research questions and friction points this paper is trying to address.

MLLMs lack multimodal memory for visual and logical reasoning patterns
Existing memory systems lose essential knowledge and record only single-modality traces
Current approaches fail to preserve how visual attention and logical reasoning jointly contribute to solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream memory encodes visual distractions and logical errors separately
Grow-and-refine principle accumulates multimodal semantic knowledge incrementally
Schema-based memory preserves stable strategies while avoiding catastrophic forgetting
🔎 Similar Papers
No similar papers found.
W
Weihao Bo
Nanjing University of Science and Technology, Baidu Inc
S
Shan Zhang
Adelaide AIML
Yanpeng Sun
Yanpeng Sun
Nanjing University of Science and Technology
Computer visionDeep LearningMultimedia
J
Jingjing Wu
Baidu Inc
Qunyi Xie
Qunyi Xie
Baidu VIS
OCR、MLLM
X
Xiao Tan
Baidu Inc
K
Kunbin Chen
Baidu Inc
W
Wei He
Baidu Inc
Xiaofan Li
Xiaofan Li
East China Normal University
Computer Vision
N
Na Zhao
Singapore University of Technology and Design
J
Jingdong Wang
Baidu Inc
Z
Zechao Li
Nanjing University of Science and Technology