🤖 AI Summary
Existing compositional image retrieval benchmarks primarily focus on generic images and are ill-suited for fine-grained semantic reasoning and structured visual understanding required by cultural artifacts such as Thangka paintings. To address this gap, this work introduces CIRThan, a novel dataset comprising 2,287 high-quality Thangka images, each accompanied by a hand-drawn structured sketch and a three-level hierarchical textual description. CIRThan is the first benchmark to enable fine-grained multimodal retrieval with joint sketch-and-text queries. Through standardized data splits and comprehensive annotations, we provide a rigorous evaluation of both supervised and zero-shot compositional retrieval methods. Experiments reveal that current general-purpose approaches struggle to align abstract sketches with hierarchical semantics in the absence of domain-specific knowledge, underscoring the necessity and challenge of CIRThan as a new benchmark for cultural image understanding.
📝 Abstract
Composed Image Retrieval (CIR) enables image retrieval by combining multiple query modalities, but existing benchmarks predominantly focus on general-domain imagery and rely on reference images with short textual modifications. As a result, they provide limited support for retrieval scenarios that require fine-grained semantic reasoning, structured visual understanding, and domain-specific knowledge. In this work, we introduce CIRThan, a sketch+text Composed Image Retrieval dataset for Thangka imagery, a culturally grounded and knowledge-specific visual domain characterized by complex structures, dense symbolic elements, and domain-dependent semantic conventions. CIRThan contains 2,287 high-quality Thangka images, each paired with a human-drawn sketch and hierarchical textual descriptions at three semantic levels, enabling composed queries that jointly express structural intent and multi-level semantic specification. We provide standardized data splits, comprehensive dataset analysis, and benchmark evaluations of representative supervised and zero-shot CIR methods. Experimental results reveal that existing CIR approaches, largely developed for general-domain imagery, struggle to effectively align sketch-based abstractions and hierarchical textual semantics with fine-grained Thangka images, particularly without in-domain supervision. We believe CIRThan offers a valuable benchmark for advancing sketch+text CIR, hierarchical semantic modeling, and multimodal retrieval in cultural heritage and other knowledge-specific visual domains. The dataset is publicly available at https://github.com/jinyuxu-whut/CIRThan.