LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval

📅 2025-11-09
🏛️ Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In cultural heritage cross-modal retrieval, incomplete and inconsistent textual descriptions arise from historical text scarcity and high annotation costs. To address this, we propose the C³ framework: (1) an integrity assessment module quantifies semantic coverage of visual content by text; (2) an adaptive query control mechanism, formulated as a Markov Decision Process and optimized via reinforcement learning, guides large language models (LLMs) to perform vision-aligned chain-of-reasoning—effectively mitigating hallucination and enhancing factual consistency. Crucially, C³ generates high-quality, vision-grounded textual enhancements without requiring additional human annotations. Evaluated on CulTi, TimeTravel, MSCOCO, and Flickr30K, C³ achieves state-of-the-art performance under both fine-tuned and zero-shot settings, significantly improving cross-modal retrieval accuracy.

Technology Category

Application Category

📝 Abstract
Cross-modal retrieval is essential for interpreting cultural heritage data, but its effectiveness is often limited by incomplete or inconsistent textual descriptions, caused by historical data loss and the high cost of expert annotation. While large language models (LLMs) offer a promising solution by enriching textual descriptions, their outputs frequently suffer from hallucinations or miss visually grounded details. To address these challenges, we propose $C^3$, a data augmentation framework that enhances cross-modal retrieval performance by improving the completeness and consistency of LLM-generated descriptions. $C^3$ introduces a completeness evaluation module to assess semantic coverage using both visual cues and language-model outputs. Furthermore, to mitigate factual inconsistencies, we formulate a Markov Decision Process to supervise Chain-of-Thought reasoning, guiding consistency evaluation through adaptive query control. Experiments on the cultural heritage datasets CulTi and TimeTravel, as well as on general benchmarks MSCOCO and Flickr30K, demonstrate that $C^3$ achieves state-of-the-art performance in both fine-tuned and zero-shot settings.
Problem

Research questions and friction points this paper is trying to address.

Incomplete textual descriptions limit cultural heritage cross-modal retrieval effectiveness
LLM-generated descriptions often contain hallucinations and miss visual details
Evaluating completeness and consistency of augmented cultural heritage data is challenging
Innovation

Methods, ideas, or system contributions that make the work stand out.

Completeness evaluation using visual and language cues
Markov Decision Process supervises Chain-of-Thought reasoning
Adaptive query control for consistency evaluation
🔎 Similar Papers
No similar papers found.
J
Jian Zhang
Xi’an Jiaotong-Liverpool University
Junyi Guo
Junyi Guo
Cornell University
J
Junyi Yuan
Xi’an Jiaotong-Liverpool University
Huanda Lu
Huanda Lu
NingboTech University
AI
Y
Yanlin Zhou
Dunhuang Academy
F
Fangyu Wu
Xi’an Jiaotong-Liverpool University
Q
Qiufeng Wang
Xi’an Jiaotong-Liverpool University
D
Dongming Lu
Zhejiang University