π€ AI Summary
Existing RAG agents predominantly rely on text-only retrieval, neglecting multimodal signals and principles of human memory. This work introduces the first cognitive scienceβinspired multimodal agent system: it abstracts multimodal inputs (e.g., images, text) into semantic tags and organizes them into a graph-structured contextual memory bank, enabling concept-driven, efficient retrieval and task execution. By transcending the semantic granularity and structural rigidity of conventional vector databases, the system supports dynamic relational reasoning and interpretable, explainable retrieval. Our approach integrates large language models, multimodal encoders, RAG, and graph-based memory mechanisms. Empirical evaluation shows a 3.5Γ speedup in inference latency on ImageNet classification and superior performance over MemGPT on the MSC benchmark. These results demonstrate significant advances in efficiency, interpretability, and cognitive plausibility.
π Abstract
Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.