EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current large-scale multimodal models exhibit poor performance on culturally dependent everyday visual question answering (VQA), particularly for low-resource languages. To address this, we propose EverydayMMQA, the first framework to explicitly integrate cultural background knowledge into multilingual spoken visual question answering. We introduce OASIS, a new benchmark dataset covering 18 countries and supporting English and Arabic dialects, comprising 920K images, 14.8M text-based QA pairs, and 3.7M spoken questions. EverydayMMQA introduces the Spoken Visual Question Answering (SVQA) task, jointly modeling speech, image, and text modalities to evaluate commonsense, pragmatic, and culture-sensitive reasoning. Comprehensive evaluation across eight state-of-the-art models reveals substantial bottlenecks in culturally grounded inference, highlighting critical gaps in cross-cultural multimodal understanding. Our work establishes a novel benchmark and opens new research directions for culturally aware multimodal AI.

Technology Category

Application Category

📝 Abstract

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

Problem

Research questions and friction points this paper is trying to address.

Addressing cultural grounding gaps in multimodal question answering systems

Developing multilingual datasets for underrepresented languages and contexts

Enhancing model performance on pragmatic and commonsense reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrating speech, images, and text

Culturally grounded dataset for underrepresented languages

Benchmark for pragmatic and commonsense reasoning tasks

🔎 Similar Papers

How Culturally Aware are Vision-Language Models?