🤖 AI Summary
Current vision-language models (VLMs) exhibit significant limitations in understanding cultural nuances, and retrieval-augmented generation (RAG) remains unexplored systematically for multimodal cultural understanding. To address this, we introduce RAVENEA—the first retrieval-augmented multimodal benchmark for visual cultural understanding—comprising two tasks: culture-oriented visual question answering (cVQA) and culture-aware image captioning (cIC). It integrates over 10,000 human-annotated, ranked Wikipedia documents. Our work pioneers the systematic integration of RAG into multimodal cultural understanding, establishing a culture-aware evaluation framework and a high-quality cross-modal retrieval dataset. We propose culturally semantic-aligned retrieval ranking, human-in-the-loop curation, and lightweight VLM fine-tuning strategies. Experiments demonstrate that retrieval augmentation improves absolute accuracy by 3.2% on cVQA and 6.2% on cIC for lightweight VLMs, validating its effectiveness.
📝 Abstract
As vision-language models (VLMs) become increasingly integrated into daily life, the need for accurate visual culture understanding is becoming critical. Yet, these models frequently fall short in interpreting cultural nuances effectively. Prior work has demonstrated the effectiveness of retrieval-augmented generation (RAG) in enhancing cultural understanding in text-only settings, while its application in multimodal scenarios remains underexplored. To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC). RAVENEA extends existing datasets by integrating over 10,000 Wikipedia documents curated and ranked by human annotators. With RAVENEA, we train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art VLMs. Our results show that lightweight VLMs, when augmented with culture-aware retrieval, outperform their non-augmented counterparts (by at least 3.2% absolute on cVQA and 6.2% absolute on cIC). This highlights the value of retrieval-augmented methods and culturally inclusive benchmarks for multimodal understanding.