ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

📅 2024-10-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Bangla Visual Question Answering (VQA) has long suffered from cultural misalignment and data scarcity, as existing datasets largely rely on translated foreign sources and lack culturally grounded images with semantically aligned annotations. To address this, we introduce ChitroJera—the first large-scale, culturally adapted Bangla VQA benchmark, comprising over 15,000 image–question–answer triplets drawn from authentic local contexts. Methodologically, we propose a lightweight dual-encoder architecture specifically designed for low-resource languages, integrating multimodal features, pretrained vision-language encoders, and LLM-based prompt engineering; cultural relevance is ensured via region-specific data collection and collaborative human-in-the-loop annotation. Experiments demonstrate that our dual-encoder model substantially outperforms comparably sized multimodal baselines, while LLM prompt tuning achieves state-of-the-art performance. The ChitroJera dataset is publicly released, bridging a critical gap in Bangla vision-language understanding.

Technology Category

Application Category

📝 Abstract
Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of a proper benchmark dataset. The absence of such datasets challenges models that are known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little cultural relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset titled ChitroJera, totaling over 15k samples where diverse and locally relevant data sources are used. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of its scale. We also evaluate the performance of large language models (LLMs) using prompt-based techniques, with LLMs achieving the best performance. Given the underdeveloped state of existing datasets, we envision ChitroJera expanding the scope of Vision-Language tasks in Bangla.
Problem

Research questions and friction points this paper is trying to address.

Lack of regionally relevant Bangla VQA datasets
Underperformance of models in low-resource Bangla VQA
Need for scalable multimodal benchmarks in Bangla
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale Bangla VQA dataset ChitroJera
Novel dual-encoder models outperform others
Prompt-based techniques for LVLMs evaluation
🔎 Similar Papers
No similar papers found.
D
Deeparghya Dutta Barua
Research and Development, Penta Global Limited, Bangladesh
M
Md Sakib Ul Rahman Sourove
Research and Development, Penta Global Limited, Bangladesh
Md Farhan Ishmam
Md Farhan Ishmam
Ph.D. Student, University of Utah
Visual Question AnsweringMultimodal LearningLow Resource NLP
F
Fabiha Haider
Research and Development, Penta Global Limited, Bangladesh
F
Fariha Tanjim Shifat
Research and Development, Penta Global Limited, Bangladesh
M
Md Fahim
Research and Development, Penta Global Limited, Bangladesh; CCDSLab, Independent University, Bangladesh
M
Md Farhad Alam
Research and Development, Penta Global Limited, Bangladesh