EgoMem: Lifelong Memory Agent for Full-duplex Omnimodal Models

πŸ“… 2025-09-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

254K/year
πŸ€– AI Summary
EgoMem addresses the limitation of existing text-based memory agents, which fail to support embodied agents’ continuous multimodal perception and real-time interaction. It introduces the first lifelong memory agent framework designed specifically for full-duplex, real-time audio-visual streaming. Methodologically, it employs an asynchronous three-process architecture operating directly on raw audio-video streams: (1) simultaneous speaker identification via voiceprint and face recognition; (2) multimodal dialogue generation; and (3) unsupervised dialogue boundary detection for incremental memory updates. Its end-to-end audio-visual-native memory mechanism enables asynchronous retrieval, personalized response generation, and long-term knowledge evolution. Experiments demonstrate >95% accuracy in user retrieval and memory management; integrated with RoboEgo, it achieves 87.3% factual consistency in real-time personalized dialogue. This work establishes a novel paradigm for lifelong learning and socially grounded interaction in multimodal embodied intelligence.

Technology Category

Application Category

πŸ“ Abstract
We introduce EgoMem, the first lifelong memory agent tailored for full-duplex models that process real-time omnimodal streams. EgoMem enables real-time models to recognize multiple users directly from raw audiovisual streams, to provide personalized response, and to maintain long-term knowledge of users' facts, preferences, and social relationships extracted from audiovisual history. EgoMem operates with three asynchronous processes: (i) a retrieval process that dynamically identifies user via face and voice, and gathers relevant context from a long-term memory; (ii) an omnimodal dialog process that generates personalized audio responses based on the retrieved context; and (iii) a memory management process that automatically detects dialog boundaries from omnimodal streams, and extracts necessary information to update the long-term memory. Unlike existing memory agents for LLMs, EgoMem relies entirely on raw audiovisual streams, making it especially suitable for lifelong, real-time, and embodied scenarios. Experimental results demonstrate that EgoMem's retrieval and memory management modules achieve over 95% accuracy on the test set. When integrated with a fine-tuned RoboEgo omnimodal chatbot, the system achieves fact-consistency scores above 87% in real-time personalized dialogs, establishing a strong baseline for future research.
Problem

Research questions and friction points this paper is trying to address.

Develops lifelong memory agent for real-time omnimodal user recognition
Enables personalized responses through audiovisual context retrieval
Maintains long-term user knowledge from continuous multimodal streams
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time user recognition from audiovisual streams
Personalized responses via omnimodal dialog process
Automatic memory management with boundary detection
πŸ”Ž Similar Papers