Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the lack of dedicated benchmarks for cross-modal retrieval of Chinese traditional culture, this paper introduces CulTi—the first bilingual (Chinese–English) image–text retrieval dataset tailored to cultural heritage, comprising 5,726 paired samples of silk patterns and Dunhuang murals. CulTi fills a critical gap by providing fine-grained cultural motif–text alignments at the local level. We propose LACLIP, a training-free local alignment method built upon fine-tuned Chinese CLIP, which enables weighted similarity matching between global text embeddings and local visual regions. On CulTi, LACLIP significantly outperforms existing models in both text-to-image and image-to-text retrieval, especially on fine-grained cultural semantic association tasks. This work establishes a new benchmark and paradigm for intelligent understanding of traditional Chinese culture.

Technology Category

Application Category

📝 Abstract
China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.
Problem

Research questions and friction points this paper is trying to address.

Lack of specialized datasets for Chinese cultural heritage cross-modal retrieval
Difficulty in aligning intricate decorative motifs with specialized textual descriptions
Need for improved fine-grained semantic associations in cross-modal retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CulTi dataset for Chinese cultural heritage
Proposes LACLIP for local alignment in cross-modal retrieval
Enhances global-local alignment with weighted similarity scores
🔎 Similar Papers
No similar papers found.
J
Junyi Yuan
Xi’an Jiaotong-Liverpool University, China
J
Jian Zhang
Xi’an Jiaotong-Liverpool University, China
F
Fangyu Wu
Xi’an Jiaotong-Liverpool University, China
D
Dongming Lu
Zhejiang University, China
Huanda Lu
Huanda Lu
NingboTech University
AI
Q
Qiufeng Wang
Xi’an Jiaotong-Liverpool University, China