🤖 AI Summary
This work addresses the critical gap in multimodal cultural understanding of existing vision-language models regarding the linguistic diversity of Bengali. Relying solely on Standard Bengali for evaluation substantially overestimates model performance. To remedy this, we introduce BanglaVerse—the first multimodal benchmark that systematically integrates four related languages and five Bengali dialects across nine cultural domains. Constructed through expert-curated images, multilingual and multidialectal annotations, and tasks including visual question answering and image captioning, BanglaVerse establishes a culturally aware evaluation framework. Experiments reveal significant performance degradation—particularly in generation tasks—when models encounter dialectal variations, highlighting the acute bottleneck posed by insufficient cultural knowledge. BanglaVerse thus provides a more realistic and fine-grained platform for assessing the cultural comprehension capabilities of multilingual multimodal models.
📝 Abstract
Bangla culture is richly expressed through region, dialect, history, food, politics, media, and everyday visual life, yet it remains underrepresented in multimodal evaluation. To address this gap, we introduce BanglaVerse, a culturally grounded benchmark for evaluating multilingual vision-language models (VLMs) on Bengali culture across historically linked languages and regional dialects. Built from 1,152 manually curated images across nine domains, the benchmark supports visual question answering and captioning, and is expanded into four languages and five Bangla dialects, yielding ~32.3K artifacts. Our experiments show that evaluating only standard Bangla overestimates true model capability: performance drops under dialectal variation, especially for caption generation, while historically linked languages such as Hindi and Urdu retain some cultural meaning but remain weaker for structured reasoning. Across domains, the main bottleneck is missing cultural knowledge rather than visual grounding alone, with knowledge-intensive categories. These findings position BanglaVerse as a more realistic test bed for measuring culturally grounded multimodal understanding under linguistic variation.