DongbaMIE: A Multimodal Information Extraction Dataset for Evaluating Semantic Understanding of Dongba Pictograms

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the long-standing stagnation in Dongba pictographic script semantic understanding—primarily due to the absence of high-quality multimodal datasets—this paper introduces DongbaMIE, the first dedicated multimodal dataset for Dongba script. It comprises 23,530 sentence-level and 2,539 paragraph-level annotated images, with fine-grained semantic labels across four dimensions: objects, actions, relations, and attributes. We establish the first multimodal semantic understanding benchmark for Dongba script, formalizing a four-dimensional information extraction task. Leveraging human-annotated image–Chinese semantic pairs, we conduct zero-shot and supervised fine-tuning evaluations on state-of-the-art multimodal large language models (MLLMs)—GPT-4o, Gemini-2.0, and Qwen2-VL—using F1 score as the primary metric. Results reveal severe limitations: best zero-shot object extraction F1 is only 3.16; even after fine-tuning, Qwen2-VL achieves merely 11.49, underscoring fundamental challenges in deep semantic interpretation of ancient scripts and highlighting the urgent need for novel methodological paradigms.

Technology Category

Application Category

📝 Abstract
Dongba pictographs are the only pictographs still in use in the world. They have pictorial ideographic features, and their symbols carry rich cultural and contextual information. Due to the lack of relevant datasets, existing research has difficulty in advancing the study of semantic understanding of Dongba pictographs. To this end, we propose DongbaMIE, the first multimodal dataset for semantic understanding and extraction of Dongba pictographs. The dataset consists of Dongba pictograph images and their corresponding Chinese semantic annotations. It contains 23,530 sentence-level and 2,539 paragraph-level images, covering four semantic dimensions: objects, actions, relations, and attributes. We systematically evaluate the GPT-4o, Gemini-2.0, and Qwen2-VL models. Experimental results show that the F1 scores of GPT-4o and Gemini in the best object extraction are only 3.16 and 3.11 respectively. The F1 score of Qwen2-VL after supervised fine-tuning is only 11.49. These results suggest that current large multimodal models still face significant challenges in accurately recognizing the diverse semantic information in Dongba pictographs. The dataset can be obtained from this URL.
Problem

Research questions and friction points this paper is trying to address.

Lack of datasets for Dongba pictograph semantic understanding.
Need for multimodal dataset to evaluate semantic extraction.
Current models struggle with Dongba pictograph semantic recognition.
Innovation

Methods, ideas, or system contributions that make the work stand out.

First multimodal dataset for Dongba pictographs
Includes images and Chinese semantic annotations
Evaluates GPT-4o, Gemini-2.0, Qwen2-VL models
🔎 Similar Papers
No similar papers found.
Xiaojun Bi
Xiaojun Bi
Department of Computer Science, Stony Brook University
Human Computer InteractionMobile User InterfacesText InputHuman Performance Models
S
Shuo Li
College of Information and Communication Engineering, Harbin Engineering University, Harbin, China
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Fuwen Luo
Fuwen Luo
Tsinghua University
Computer Science
W
Weizheng Qiao
College of Information and Engineering, Minzu University of China, Beijing, China; Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing, China
L
Lu Han
College of Information and Engineering, Minzu University of China, Beijing, China; Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing, China
Z
Ziwei Sun
College of Information and Engineering, Minzu University of China, Beijing, China; Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China, Beijing, China
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China