M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of evaluation frameworks and performance bottlenecks in multilingual, multimodal retrieval-augmented generation (RAG) for cross-cultural visual question answering (VQA). We introduce the first large-scale, multilingual, multicultural, and multimodal RAG benchmark—encompassing 42 languages, 56 dialects, and over 80,000 culturally diverse image–text question–answer triples. The benchmark enables controlled, reproducible multilingual document retrieval experiments. Our systematic evaluation uncovers a critical performance mismatch between large vision-language models (VLMs) and existing retrieval mechanisms: while RAG consistently improves smaller VLMs, it induces negative interference in larger VLMs—revealing a fundamental scalability challenge in retrieval-generation co-adaptation. These findings establish a novel evaluation paradigm and deliver key technical insights for cross-lingual, cross-modal, and cross-cultural reasoning.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
Problem

Research questions and friction points this paper is trying to address.

Evaluates multilingual multimodal retrieval-augmented visual question answering across diverse cultures
Addresses the limitation of static training data in vision-language models via retrieval augmentation
Investigates the mismatch between model size and retrieval effectiveness in RAG systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual multimodal RAG benchmark with 42 languages
Controlled retrieval environment with millions of curated documents
Evaluates RAG performance across model sizes and modalities