TeleMem: Building Long-Term and Multimodal Memory for Agentic AI

📅 2025-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of large language models in maintaining coherent and efficient long-term multimodal memory during extended interactions, primarily due to constraints inherent in attention mechanisms. To overcome this, the authors propose a unified long-term multimodal memory system that preserves critical dialogue context through narrative-driven dynamic extraction, employs a structured write pipeline for batched clustering and integration of memories, and incorporates ReAct-style reasoning to enable closed-loop understanding and action over multimodal inputs such as video. Evaluated on the ZH-4O long-term role-playing benchmark, the system achieves a 19% improvement in memory accuracy over the Mem0 baseline, reduces token consumption by 43%, and operates 2.1 times faster.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.

Problem

Research questions and friction points this paper is trying to address.

long-term memory

multimodal reasoning

retrieval-augmented generation

memory updating

dialogue history

Innovation

Methods, ideas, or system contributions that make the work stand out.

TeleMem

long-term memory

multimodal reasoning