M2A: Multimodal Memory Agent with Dual-Layer Hybrid Memory for Long-Term Personalized Interactions

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenge of personalized question answering in long-term human–AI interaction, where limited context windows hinder the continuous modeling of users’ evolving concepts, aliases, and preferences. To overcome this, the authors propose a dual-agent multimodal memory system comprising a ChatAgent and a MemoryManager that collaboratively maintain a two-tier hybrid memory architecture: a RawMessageStore preserving original dialogue logs and a SemanticMemoryStore capturing high-level semantic observations. This framework transforms personalization from a static configuration into a dynamic, co-evolving mechanism shaped by ongoing user interaction. Leveraging a concept-injection data synthesis pipeline based on Yo’LLaVA and MC-LLaVA, the approach significantly outperforms baseline models in long-horizon multimodal dialogue tasks, yielding markedly improved quality in personalized responses.

Technology Category

Application Category

📝 Abstract

This work addresses the challenge of personalized question answering in long-term human-machine interactions: when conversational history spans weeks or months and exceeds the context window, existing personalization mechanisms struggle to continuously absorb and leverage users'incremental concepts, aliases, and preferences. Current personalized multimodal models are predominantly static-concepts are fixed at initialization and cannot evolve during interactions. We propose M2A, an agentic dual-layer hybrid memory system that maintains personalized multimodal information through online updates. The system employs two collaborative agents: ChatAgent manages user interactions and autonomously decides when to query or update memory, while MemoryManager breaks down memory requests from ChatAgent into detailed operations on the dual-layer memory bank, which couples a RawMessageStore (immutable conversation log) with a SemanticMemoryStore (high-level observations), providing memories at different granularities. In addition, we develop a reusable data synthesis pipeline that injects concept-grounded sessions from Yo'LLaVA and MC-LLaVA into LoCoMo long conversations while preserving temporal coherence. Experiments show that M2A significantly outperforms baselines, demonstrating that transforming personalization from one-shot configuration to a co-evolving memory mechanism provides a viable path for high-quality individualized responses in long-term multimodal interactions. The code is available at https://github.com/Little-Fridge/M2A.

Problem

Research questions and friction points this paper is trying to address.

personalized question answering

long-term human-machine interactions

multimodal personalization

incremental user concepts

context window limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Memory Agent

Dual-Layer Hybrid Memory

Long-Term Personalization