🤖 AI Summary
Current large-scale multimodal models exhibit poor performance on culturally dependent everyday visual question answering (VQA), particularly for low-resource languages. To address this, we propose EverydayMMQA, the first framework to explicitly integrate cultural background knowledge into multilingual spoken visual question answering. We introduce OASIS, a new benchmark dataset covering 18 countries and supporting English and Arabic dialects, comprising 920K images, 14.8M text-based QA pairs, and 3.7M spoken questions. EverydayMMQA introduces the Spoken Visual Question Answering (SVQA) task, jointly modeling speech, image, and text modalities to evaluate commonsense, pragmatic, and culture-sensitive reasoning. Comprehensive evaluation across eight state-of-the-art models reveals substantial bottlenecks in culturally grounded inference, highlighting critical gaps in cross-cultural multimodal understanding. Our work establishes a novel benchmark and opens new research directions for culturally aware multimodal AI.
📝 Abstract
Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.