🤖 AI Summary
To address the weak long-term memory and poor interpretability of intelligent personal assistants in real-world scenarios, this paper proposes the first embodied memory system framework. It integrates vision-language models (VLMs) with large language models (LLMs) to enable multimodal perception and structured information extraction; jointly constructs a unified memory representation combining knowledge graphs and vector embeddings to support retrieval-augmented question answering driven by both semantic search and graph querying. Its key innovation lies in the first integration of VLM-based understanding, graph-based knowledge modeling, and vector memory within a closed-loop embodied memory architecture—achieving seamless integration of perception, memory, and reasoning. Experiments on real-world cases demonstrate significant improvements in temporal event memory consistency, relational traceability, and complex QA accuracy. The framework provides verifiable, interpretable long-term memory support for high-reliability cognitive assistance applications.
📝 Abstract
A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics - demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three components. First, we combine Vision Language Models for image captioning and entity disambiguation with Large Language Models for consistent information extraction during perception. Second, the extracted information is represented in a memory consisting of a knowledge graph enhanced by vector embeddings to efficiently manage relational information. Third, we combine semantic search and graph query generation for question answering via Retrieval Augmented Generation. We illustrate the system's working and potential using a real-world example.