🤖 AI Summary
Bangla Visual Question Answering (VQA) has long suffered from cultural misalignment and data scarcity, as existing datasets largely rely on translated foreign sources and lack culturally grounded images with semantically aligned annotations. To address this, we introduce ChitroJera—the first large-scale, culturally adapted Bangla VQA benchmark, comprising over 15,000 image–question–answer triplets drawn from authentic local contexts. Methodologically, we propose a lightweight dual-encoder architecture specifically designed for low-resource languages, integrating multimodal features, pretrained vision-language encoders, and LLM-based prompt engineering; cultural relevance is ensured via region-specific data collection and collaborative human-in-the-loop annotation. Experiments demonstrate that our dual-encoder model substantially outperforms comparably sized multimodal baselines, while LLM prompt tuning achieves state-of-the-art performance. The ChitroJera dataset is publicly released, bridging a critical gap in Bangla vision-language understanding.
📝 Abstract
Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of a proper benchmark dataset. The absence of such datasets challenges models that are known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little cultural relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset titled ChitroJera, totaling over 15k samples where diverse and locally relevant data sources are used. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of its scale. We also evaluate the performance of large language models (LLMs) using prompt-based techniques, with LLMs achieving the best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.