🤖 AI Summary
AI agents commonly suffer from catastrophic forgetting and deficient long-term memory, hindering multimodal interaction and sustained reasoning. To address this, we propose a model-agnostic, plug-and-play multimodal hierarchical memory framework that integrates parameterized short-term memory with retrieval-based, structured long-term memory, enabling scalable lifelong learning. Our approach introduces two key innovations: (1) a hierarchical knowledge graph representation for multimodal experience encoding and interpretable recall, and (2) a periodic knowledge distillation mechanism that enables adaptive forgetting, bounded memory growth, and continual memory consolidation. The framework supports cross-modal experience encoding, hierarchical retrieval, and explainable memory retrieval. Experiments demonstrate significant improvements in agent consistency, adaptability, and reasoning performance across multimodal reasoning, continual learning, and long-horizon interactive tasks.
📝 Abstract
Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.