AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

πŸ“… 2025-10-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing RAG agents predominantly rely on text-only retrieval, neglecting multimodal signals and principles of human memory. This work introduces the first cognitive science–inspired multimodal agent system: it abstracts multimodal inputs (e.g., images, text) into semantic tags and organizes them into a graph-structured contextual memory bank, enabling concept-driven, efficient retrieval and task execution. By transcending the semantic granularity and structural rigidity of conventional vector databases, the system supports dynamic relational reasoning and interpretable, explainable retrieval. Our approach integrates large language models, multimodal encoders, RAG, and graph-based memory mechanisms. Empirical evaluation shows a 3.5Γ— speedup in inference latency on ImageNet classification and superior performance over MemGPT on the MSC benchmark. These results demonstrate significant advances in efficiency, interpretability, and cognitive plausibility.

Technology Category

Application Category

πŸ“ Abstract
Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.
Problem

Research questions and friction points this paper is trying to address.

Developing multimodal agent systems with contextualized memory
Addressing limitations of text-only memory in existing systems
Enhancing concept-driven retrieval using graph-structured multimodal memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-structured multimodal contextual memory system
Semantic tag-based concept-driven retrieval method
Four-stage loop: encode, store, retrieve, act
πŸ”Ž Similar Papers
No similar papers found.