A Culturally-diverse Multilingual Multimodal Video Benchmark&Model

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing multilingual video multimodal models heavily rely on English and lack support for low-resource languages and deep cultural semantics (e.g., festivals, rituals, landmarks). To address this, we propose ViMUL—the first large language-video model explicitly designed for linguistic and cultural inclusivity—and introduce ViMUL-Bench: the first open-source, human-verified (8K samples), multilingual multimodal video benchmark covering 14 languages (including 8 low-resource ones) and 15 cultural themes. Methodologically, ViMUL integrates multilingual instruction tuning, cross-lingual vision–text alignment, machine-translation-augmented data construction, and native-language human validation, enabling multi-granularity video understanding. Experiments demonstrate that ViMUL significantly enhances video comprehension for low-resource languages, achieving more balanced performance across all 14 languages. ViMUL-Bench is fully open-sourced, establishing a new standard for culturally grounded, multilingual video understanding research.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

Problem

Research questions and friction points this paper is trying to address.

Lack of multilingual video LMMs for cultural inclusivity

Need for benchmark testing across 14 diverse languages

Developing inclusive models for high- and low-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Video LMM benchmark ViMUL-Bench

Machine translated multilingual video training set

Simple multilingual video LMM named ViMUL

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs