๐ค AI Summary
To address the limited intuitiveness and interactivity of existing scientific literature retrieval systems, this paper proposes a multimodal agent-enhanced Retrieval-Augmented Generation (RAG) framework. The method innovatively constructs Large Vision-Language Models (LVLMs) as collaborative, tool-augmented multimodal agents capable of natural-language-driven cross-modal exploration. It integrates multi-agent coordination, multimodal retrieval, and fused multimodal understanding to enable curiosity-guided interdisciplinary association discovery and complementary text-image analysis. Empirical evaluation across 32 institutional repositories comprising over 64,000 records demonstrates significant improvements in user exploratory efficiency and scientific engagement. The framework proves effective in both science communication and research assistance scenarios, validating its dual utility for education and scholarly support.
๐ Abstract
In this paper, we introduce CollEx, an innovative multimodal agentic Retrieval-Augmented Generation (RAG) system designed to enhance interactive exploration of extensive scientific collections. Given the overwhelming volume and inherent complexity of scientific collections, conventional search systems often lack necessary intuitiveness and interactivity, presenting substantial barriers for learners, educators, and researchers. CollEx addresses these limitations by employing state-of-the-art Large Vision-Language Models (LVLMs) as multimodal agents accessible through an intuitive chat interface. By abstracting complex interactions via specialized agents equipped with advanced tools, CollEx facilitates curiosity-driven exploration, significantly simplifying access to diverse scientific collections and records therein. Our system integrates textual and visual modalities, supporting educational scenarios that are helpful for teachers, pupils, students, and researchers by fostering independent exploration as well as scientific excitement and curiosity. Furthermore, CollEx serves the research community by discovering interdisciplinary connections and complementing visual data. We illustrate the effectiveness of our system through a proof-of-concept application containing over 64,000 unique records across 32 collections from a local scientific collection from a public university.