🤖 AI Summary
Oracle bone script research faces two major bottlenecks: highly complex, multi-stage workflows (involving both sequential and parallel subtasks) and inefficient resource retrieval. To address these challenges, this paper introduces the first intelligent agent system specifically designed for oracle bone script analysis. The system integrates large language models, multimodal reasoning, image retrieval, and knowledge graph technologies to construct a domain-specific multimodal knowledge base encompassing over one million samples. We propose a modular agent architecture capable of dynamically orchestrating specialized tools—including character recognition, glyph matching, and textual interpretation—to enable cross-modal semantic alignment and end-to-end task coordination. Experimental results demonstrate that our system outperforms state-of-the-art multimodal foundation models (e.g., GPT-4o) on multiple oracle bone script reasoning benchmarks. Moreover, it significantly reduces expert research cycles, with empirical validation confirming its efficacy and practical utility in real-world scholarly workflows.
📝 Abstract
As one of the earliest writing systems, Oracle Bone Script (OBS) preserves the cultural and intellectual heritage of ancient civilizations. However, current OBS research faces two major challenges: (1) the interpretation of OBS involves a complex workflow comprising multiple serial and parallel sub-tasks, and (2) the efficiency of OBS information organization and retrieval remains a critical bottleneck, as scholars often spend substantial effort searching for, compiling, and managing relevant resources. To address these challenges, we present OracleAgent, the first agent system designed for the structured management and retrieval of OBS-related information. OracleAgent seamlessly integrates multiple OBS analysis tools, empowered by large language models (LLMs), and can flexibly orchestrate these components. Additionally, we construct a comprehensive domain-specific multimodal knowledge base for OBS, which is built through a rigorous multi-year process of data collection, cleaning, and expert annotation. The knowledge base comprises over 1.4M single-character rubbing images and 80K interpretation texts. OracleAgent leverages this resource through its multimodal tools to assist experts in retrieval tasks of character, document, interpretation text, and rubbing image. Extensive experiments demonstrate that OracleAgent achieves superior performance across a range of multimodal reasoning and generation tasks, surpassing leading mainstream multimodal large language models (MLLMs) (e.g., GPT-4o). Furthermore, our case study illustrates that OracleAgent can effectively assist domain experts, significantly reducing the time cost of OBS research. These results highlight OracleAgent as a significant step toward the practical deployment of OBS-assisted research and automated interpretation systems.