IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model (VLM) evaluation benchmarks exhibit Western-centric biases and lack systematic assessment of cultural diversity and multilingual capabilities. Method: We introduce IndicVLM-Bench—the first large-scale, multilingual vision-language benchmark focused on the Indian subcontinent—comprising 5,000 images, 37,000 question-answer pairs across 11 Indian languages, and three core tasks: visual question answering, image–text retrieval, and image captioning. It covers six question types and thirteen culturally grounded themes. We propose cross-lingual alignment and cultural context modeling techniques and release a high-quality parallel annotation corpus. Contribution/Results: Evaluating eight representative VLM families on IndicVLM-Bench, we systematically uncover substantial deficiencies in Indian language comprehension and cultural commonsense reasoning—previously unreported. The benchmark establishes a reproducible, culturally informed evaluation paradigm and provides empirical foundations for developing more inclusive multimodal AI systems.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) have demonstrated impressive generalization across multimodal tasks, yet most evaluation benchmarks remain Western-centric, leaving open questions about their performance in culturally diverse and multilingual settings. To address this gap, we introduce IndicVisionBench, the first large-scale benchmark centered on the Indian subcontinent. Covering English and 10 Indian languages, our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA), covering 6 kinds of question types. Our final benchmark consists of a total of ~5K images and 37K+ QA pairs across 13 culturally grounded topics. In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs. We evaluate a broad spectrum of 8 models, from proprietary closed-source systems to open-weights medium and large-scale models. Our experiments reveal substantial performance gaps, underscoring the limitations of current VLMs in culturally diverse contexts. By centering cultural diversity and multilinguality, IndicVisionBench establishes a reproducible evaluation framework that paves the way for more inclusive multimodal research.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' performance in culturally diverse and multilingual contexts
Assessing multimodal capabilities across OCR, MMT, and VQA tasks
Analyzing cultural and linguistic biases in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale benchmark for Indian subcontinent
Covers OCR MMT VQA across 10 Indic languages
Provides parallel corpus for cultural bias analysis
🔎 Similar Papers
No similar papers found.
Ali Faraz
Ali Faraz
Data Scientist, Krutrim
Machine LearningLLMsLVMsComputer Vision
A
Akash
OLA Electric, Bangalore, India
Shaharukh Khan
Shaharukh Khan
Unknown affiliation
Machine LearningVLM
R
Raja Kolla
Krutrim AI, Bangalore, India
A
Akshat Patidar
Krutrim AI, Bangalore, India
S
Suranjan Goswami
OLA Electric, Bangalore, India
A
Abhinav Ravi
Krutrim AI, Bangalore, India
Chandra Khatri
Chandra Khatri
Ola Krutrim AI
Artificial IntelligenceMulti-Modal AIConversational AIDeep LearningMachine Learning
S
Shubham Agarwal
Krutrim AI, Bangalore, India