Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

📅 2025-08-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical limitation of long-term memory in multimodal agents. We propose M3-Agent—the first real-time multimodal agent with entity-centric long-term memory. Its core innovation lies in a cross-modal unified memory representation that dynamically anchors visual and auditory inputs to semantic entities, enabling memory-grounded iterative reasoning and decision-making. Methodologically, we design an end-to-end memory-augmented architecture that integrates multimodal large language models (e.g., Gemini-1.5-Pro, GPT-4o) for memory writing, retrieval, and updating, and employ reinforcement learning to optimize memory policies. To enable systematic evaluation, we introduce M3-Bench—the first long-video question-answering benchmark from a robotic perspective. Experiments demonstrate that M3-Agent outperforms the strongest baselines by 6.7%, 7.7%, and 5.3% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively, significantly improving long-horizon perceptual consistency and cross-modal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot's perspective (M3-Bench-robot) and 929 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross- modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 6.7%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design. Model, code and data are available at https://github.com/bytedance-seed/m3-agent.
Problem

Research questions and friction points this paper is trying to address.

Developing multimodal agent with long-term memory capabilities
Enabling real-time visual and auditory input processing
Creating benchmark for evaluating memory-based reasoning performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal agent with long-term memory
Entity-centric memory organization
Reinforcement learning training framework
🔎 Similar Papers
No similar papers found.