A Sketch+Text Composed Image Retrieval Dataset for Thangka

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compositional image retrieval benchmarks primarily focus on generic images and are ill-suited for fine-grained semantic reasoning and structured visual understanding required by cultural artifacts such as Thangka paintings. To address this gap, this work introduces CIRThan, a novel dataset comprising 2,287 high-quality Thangka images, each accompanied by a hand-drawn structured sketch and a three-level hierarchical textual description. CIRThan is the first benchmark to enable fine-grained multimodal retrieval with joint sketch-and-text queries. Through standardized data splits and comprehensive annotations, we provide a rigorous evaluation of both supervised and zero-shot compositional retrieval methods. Experiments reveal that current general-purpose approaches struggle to align abstract sketches with hierarchical semantics in the absence of domain-specific knowledge, underscoring the necessity and challenge of CIRThan as a new benchmark for cultural image understanding.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.
Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval
Thangka
sketch+text
domain-specific knowledge
fine-grained semantic reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Image Retrieval
Sketch+Text
Thangka
Hierarchical Semantic Description
Cultural Heritage
J
Jinyu Xu
Wuhan University of Technology
Y
Yi Sun
Wuhan University of Technology
J
Jiangling Zhang
Wuhan University of Technology
Qing Xie
Qing Xie
Wuhan University of Technology
D
Daomin Ji
RMIT University
Z
Zhifeng Bao
The University of Queensland
Jiachen Li
Jiachen Li
Wuhan University of Technology
Y
Yanchun Ma
Wuhan Vocational College of Software and Engineering
Y
Yongjian Liu
Wuhan University of Technology