🤖 AI Summary
Existing vision-language model (VLM) evaluation benchmarks exhibit Western-centric biases and lack systematic assessment of cultural diversity and multilingual capabilities. Method: We introduce IndicVLM-Bench—the first large-scale, multilingual vision-language benchmark focused on the Indian subcontinent—comprising 5,000 images, 37,000 question-answer pairs across 11 Indian languages, and three core tasks: visual question answering, image–text retrieval, and image captioning. It covers six question types and thirteen culturally grounded themes. We propose cross-lingual alignment and cultural context modeling techniques and release a high-quality parallel annotation corpus. Contribution/Results: Evaluating eight representative VLM families on IndicVLM-Bench, we systematically uncover substantial deficiencies in Indian language comprehension and cultural commonsense reasoning—previously unreported. The benchmark establishes a reproducible, culturally informed evaluation paradigm and provides empirical foundations for developing more inclusive multimodal AI systems.
📝 Abstract
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.