IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing vision-language model (VLM) evaluation benchmarks exhibit Western-centric biases and lack systematic assessment of cultural diversity and multilingual capabilities. Method: We introduce IndicVLM-Bench—the first large-scale, multilingual vision-language benchmark focused on the Indian subcontinent—comprising 5,000 images, 37,000 question-answer pairs across 11 Indian languages, and three core tasks: visual question answering, image–text retrieval, and image captioning. It covers six question types and thirteen culturally grounded themes. We propose cross-lingual alignment and cultural context modeling techniques and release a high-quality parallel annotation corpus. Contribution/Results: Evaluating eight representative VLM families on IndicVLM-Bench, we systematically uncover substantial deficiencies in Indian language comprehension and cultural commonsense reasoning—previously unreported. The benchmark establishes a reproducible, culturally informed evaluation paradigm and provides empirical foundations for developing more inclusive multimodal AI systems.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' performance in culturally diverse and multilingual contexts

Assessing multimodal capabilities across OCR, MMT, and VQA tasks

Analyzing cultural and linguistic biases in vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale benchmark for Indian subcontinent

Covers OCR MMT VQA across 10 Indic languages

Provides parallel corpus for cultural bias analysis

🔎 Similar Papers

No similar papers found.