PictOBI-20k: Unveiling Large Multimodal Models in Visual Decipherment for Pictographic Oracle Bone Characters

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Oracle bone script (OBS) decipherment is hindered by scarce textual corpora and fragmented archaeological discoveries. Method: This work pioneers a systematic investigation of large multimodal models (LMMs) for pictographic oracle bone character (OBC) visual decipherment. We introduce PictOBI-20k—the first multimodal dataset pairing 20,000 OBC glyphs with corresponding real-world object images—and design a 15,000-item multiple-choice evaluation framework. To assess alignment between human and model visual reasoning, we incorporate human-annotated saliency maps as grounding references. Contribution/Results: Experiments reveal a critical LMM limitation: heavy reliance on linguistic priors at the expense of visual features, with quantitatively low visual attention utilization efficiency. This study establishes the first benchmark dataset, evaluation platform, and attribution analysis methodology dedicated to ancient script visual decipherment, laying foundational groundwork for developing OBS-specialized multimodal models.

Technology Category

Application Category

📝 Abstract

Deciphering oracle bone characters (OBCs), the oldest attested form of written Chinese, has remained the ultimate, unwavering goal of scholars, offering an irreplaceable key to understanding humanity's early modes of production. Current decipherment methodologies of OBC are primarily constrained by the sporadic nature of archaeological excavations and the limited corpus of inscriptions. With the powerful visual perception capability of large multimodal models (LMMs), the potential of using LMMs for visually deciphering OBCs has increased. In this paper, we introduce PictOBI-20k, a dataset designed to evaluate LMMs on the visual decipherment tasks of pictographic OBCs. It includes 20k meticulously collected OBC and real object images, forming over 15k multi-choice questions. We also conduct subjective annotations to investigate the consistency of the reference point between humans and LMMs in visual reasoning. Experiments indicate that general LMMs possess preliminary visual decipherment skills, and LMMs are not effectively using visual information, while most of the time they are limited by language priors. We hope that our dataset can facilitate the evaluation and optimization of visual attention in future OBC-oriented LMMs. The code and dataset will be available at https://github.com/OBI-Future/PictOBI-20k.

Problem

Research questions and friction points this paper is trying to address.

Deciphering ancient oracle bone characters using visual methods

Overcoming limitations of sparse archaeological data and inscriptions

Evaluating large multimodal models on visual decipherment tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multimodal models for visual decipherment

PictOBI-20k dataset with 20k OBC images

Multi-choice questions testing visual reasoning

🔎 Similar Papers

A Cross-Font Image Retrieval Network for Recognizing Undeciphered Oracle Bone Inscriptions