🤖 AI Summary
To address the long-standing stagnation in Dongba pictographic script semantic understanding—primarily due to the absence of high-quality multimodal datasets—this paper introduces DongbaMIE, the first dedicated multimodal dataset for Dongba script. It comprises 23,530 sentence-level and 2,539 paragraph-level annotated images, with fine-grained semantic labels across four dimensions: objects, actions, relations, and attributes. We establish the first multimodal semantic understanding benchmark for Dongba script, formalizing a four-dimensional information extraction task. Leveraging human-annotated image–Chinese semantic pairs, we conduct zero-shot and supervised fine-tuning evaluations on state-of-the-art multimodal large language models (MLLMs)—GPT-4o, Gemini-2.0, and Qwen2-VL—using F1 score as the primary metric. Results reveal severe limitations: best zero-shot object extraction F1 is only 3.16; even after fine-tuning, Qwen2-VL achieves merely 11.49, underscoring fundamental challenges in deep semantic interpretation of ancient scripts and highlighting the urgent need for novel methodological paradigms.
📝 Abstract
Dongba pictographs are the only pictographs still in use in the world. They have pictorial ideographic features, and their symbols carry rich cultural and contextual information. Due to the lack of relevant datasets, existing research has difficulty in advancing the study of semantic understanding of Dongba pictographs. To this end, we propose DongbaMIE, the first multimodal dataset for semantic understanding and extraction of Dongba pictographs. The dataset consists of Dongba pictograph images and their corresponding Chinese semantic annotations. It contains 23,530 sentence-level and 2,539 paragraph-level images, covering four semantic dimensions: objects, actions, relations, and attributes. We systematically evaluate the GPT-4o, Gemini-2.0, and Qwen2-VL models. Experimental results show that the F1 scores of GPT-4o and Gemini in the best object extraction are only 3.16 and 3.11 respectively. The F1 score of Qwen2-VL after supervised fine-tuning is only 11.49. These results suggest that current large multimodal models still face significant challenges in accurately recognizing the diverse semantic information in Dongba pictographs. The dataset can be obtained from this URL.