EgoMem: Lifelong Memory Agent for Full-duplex Omnimodal Models

πŸ“… 2025-09-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
EgoMem addresses the limitation of existing text-based memory agents, which fail to support embodied agents’ continuous multimodal perception and real-time interaction. It introduces the first lifelong memory agent framework designed specifically for full-duplex, real-time audio-visual streaming. Methodologically, it employs an asynchronous three-process architecture operating directly on raw audio-video streams: (1) simultaneous speaker identification via voiceprint and face recognition; (2) multimodal dialogue generation; and (3) unsupervised dialogue boundary detection for incremental memory updates. Its end-to-end audio-visual-native memory mechanism enables asynchronous retrieval, personalized response generation, and long-term knowledge evolution. Experiments demonstrate >95% accuracy in user retrieval and memory management; integrated with RoboEgo, it achieves 87.3% factual consistency in real-time personalized dialogue. This work establishes a novel paradigm for lifelong learning and socially grounded interaction in multimodal embodied intelligence.

Technology Category

Application Category

πŸ“ Abstract
We introduce EgoMem, the first lifelong memory agent tailored for full-duplex models that process real-time omnimodal streams. EgoMem enables real-time models to recognize multiple users directly from raw audiovisual streams, to provide personalized response, and to maintain long-term knowledge of users' facts, preferences, and social relationships extracted from audiovisual history. EgoMem operates with three asynchronous processes: (i) a retrieval process that dynamically identifies user via face and voice, and gathers relevant context from a long-term memory; (ii) an omnimodal dialog process that generates personalized audio responses based on the retrieved context; and (iii) a memory management process that automatically detects dialog boundaries from omnimodal streams, and extracts necessary information to update the long-term memory. Unlike existing memory agents for LLMs, EgoMem relies entirely on raw audiovisual streams, making it especially suitable for lifelong, real-time, and embodied scenarios. Experimental results demonstrate that EgoMem's retrieval and memory management modules achieve over 95% accuracy on the test set. When integrated with a fine-tuned RoboEgo omnimodal chatbot, the system achieves fact-consistency scores above 87% in real-time personalized dialogs, establishing a strong baseline for future research.
Problem

Research questions and friction points this paper is trying to address.

Develops lifelong memory agent for real-time omnimodal user recognition
Enables personalized responses through audiovisual context retrieval
Maintains long-term user knowledge from continuous multimodal streams
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time user recognition from audiovisual streams
Personalized responses via omnimodal dialog process
Automatic memory management with boundary detection
πŸ”Ž Similar Papers
No similar papers found.
Yiqun Yao
Yiqun Yao
Unknown affiliation
Naitong Yu
Naitong Yu
Beijing Academy of Artificial Intelligence
Large Language ModelsNatural Language ProcessingArtificial Intelligence
X
Xiang Li
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xin Jiang
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xuezhi Fang
Beijing Academy of Artificial Intelligence, Beijing, China
W
Wenjia Ma
Spin Matrix, China
Xuying Meng
Xuying Meng
Institute of Computing Technology, Chinese Academy of Sciences
J
Jing Li
Harbin Institute of Technology, Shenzhen, China
A
Aixin Sun
Nanyang Technological University, Singapore
Y
Yequan Wang
Beijing Academy of Artificial Intelligence, Beijing, China