A Culturally-diverse Multilingual Multimodal Video Benchmark&Model

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multilingual video multimodal models heavily rely on English and lack support for low-resource languages and deep cultural semantics (e.g., festivals, rituals, landmarks). To address this, we propose ViMUL—the first large language-video model explicitly designed for linguistic and cultural inclusivity—and introduce ViMUL-Bench: the first open-source, human-verified (8K samples), multilingual multimodal video benchmark covering 14 languages (including 8 low-resource ones) and 15 cultural themes. Methodologically, ViMUL integrates multilingual instruction tuning, cross-lingual vision–text alignment, machine-translation-augmented data construction, and native-language human validation, enabling multi-granularity video understanding. Experiments demonstrate that ViMUL significantly enhances video comprehension for low-resource languages, achieving more balanced performance across all 14 languages. ViMUL-Bench is fully open-sourced, establishing a new standard for culturally grounded, multilingual video understanding research.

Technology Category

Application Category

📝 Abstract
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
Problem

Research questions and friction points this paper is trying to address.

Lack of multilingual video LMMs for cultural inclusivity
Need for benchmark testing across 14 diverse languages
Developing inclusive models for high- and low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Video LMM benchmark ViMUL-Bench
Machine translated multilingual video training set
Simple multilingual video LMM named ViMUL
🔎 Similar Papers
No similar papers found.
B
Bhuiyan Sanjid Shafique
Mohamed bin Zayed University of AI
Ashmal Vayani
Ashmal Vayani
University of Central Florida
Computer VisionMultiModalityLarge Language ModelsResponsible AI
Muhammad Maaz
Muhammad Maaz
PhD Computer Vision at MBZUAI
Computer VisionDeep LearningVision-LanguageGenerative AI
H
H. Rasheed
Mohamed bin Zayed University of AI
Dinura Dissanayake
Dinura Dissanayake
Research Engineer, MBZUAI
Computer VisionReasoning
M
Mohammed Irfan Kurpath
Mohamed bin Zayed University of AI
Y
Yahya Hmaiti
University of Central Florida
G
Go Inoue
Mohamed bin Zayed University of AI
Jean Lahoud
Jean Lahoud
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer Vision
M
Md. Safirur Rashid
Islamic University of Technology
S
Shadid Intisar Quasem
Islamic University of Technology
M
Maheen Fatima
Air University
F
Franco Vidal
University of Central Florida
Mykola Maslych
Mykola Maslych
CS PhD candidate at ISUE Lab, University of Central Florida
Human-Computer Interaction3DUIGestural InterfacesVirtual RealityApplied Machine Learning
Ketan More
Ketan More
MBZUAI
Computer Vision
Sanoojan Baliah
Sanoojan Baliah
Research Associate
Visual GenerationDomain generalizationComputer visionMachine learning
H
Hasindri Watawana
Mohamed bin Zayed University of AI
Y
Yuhao Li
Mohamed bin Zayed University of AI
Fabian Farestam
Fabian Farestam
ETH Zürich
games on graphsllm evaluations
L
Leon Schaller
Technische Universität München
R
Roman Tymtsiv
Independent Researcher
S
Simon Weber
Technische Universität München
Hisham Cholakkal
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionLarge Multimodal ModelsLLMHealthcare Foundation ModelConversational Assistant
Ivan Laptev
Ivan Laptev
Professor at MBZUAI, on leave from INRIA
Computer VisionRoboticsAction RecognitionObject Recognition
Shin'ichi Satoh
Shin'ichi Satoh
National Institute of Informatics
Multimedia Content AnalysisComputer VisionArtificial IntelligenceInformation Retrieval
Michael Felsberg
Michael Felsberg
Professor of Computer Vision, Linköping University
Computer VisionMachine LearningRobot Vision
Mubarak Shah
Mubarak Shah
Trustee Chair Professor of Computer Science, University of Central Florida
Computer Vision
S
Salman Khan
Mohamed bin Zayed University of AI, Australian National University
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, Linköping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science