AUGUSTUS: An LLM-Driven Multimodal Agent System with Contextualized User Memory

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing RAG agents predominantly rely on text-only retrieval, neglecting multimodal signals and principles of human memory. This work introduces the first cognitive science–inspired multimodal agent system: it abstracts multimodal inputs (e.g., images, text) into semantic tags and organizes them into a graph-structured contextual memory bank, enabling concept-driven, efficient retrieval and task execution. By transcending the semantic granularity and structural rigidity of conventional vector databases, the system supports dynamic relational reasoning and interpretable, explainable retrieval. Our approach integrates large language models, multimodal encoders, RAG, and graph-based memory mechanisms. Empirical evaluation shows a 3.5× speedup in inference latency on ImageNet classification and superior performance over MemGPT on the MSC benchmark. These results demonstrate significant advances in efficiency, interpretability, and cognitive plausibility.

Technology Category

Application Category

📝 Abstract

Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.

Problem

Research questions and friction points this paper is trying to address.

Developing multimodal agent systems with contextualized memory

Addressing limitations of text-only memory in existing systems

Enhancing concept-driven retrieval using graph-structured multimodal memory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-structured multimodal contextual memory system

Semantic tag-based concept-driven retrieval method

Four-stage loop: encode, store, retrieve, act

🔎 Similar Papers

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots