๐ค AI Summary
Oracle bone script (OBS) decipherment is hindered by scarce textual corpora and fragmented archaeological discoveries. Method: This work pioneers a systematic investigation of large multimodal models (LMMs) for pictographic oracle bone character (OBC) visual decipherment. We introduce PictOBI-20kโthe first multimodal dataset pairing 20,000 OBC glyphs with corresponding real-world object imagesโand design a 15,000-item multiple-choice evaluation framework. To assess alignment between human and model visual reasoning, we incorporate human-annotated saliency maps as grounding references. Contribution/Results: Experiments reveal a critical LMM limitation: heavy reliance on linguistic priors at the expense of visual features, with quantitatively low visual attention utilization efficiency. This study establishes the first benchmark dataset, evaluation platform, and attribution analysis methodology dedicated to ancient script visual decipherment, laying foundational groundwork for developing OBS-specialized multimodal models.
๐ Abstract
Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity's early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.