MELLA: Bridging Linguistic Capability and Cultural Groundedness for Low-Resource Language MLLMs

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the substantial performance degradation of multimodal large language models (MLLMs) on low-resource languages, this paper proposes a dual enhancement paradigm integrating linguistic and cultural awareness. Methodologically, we construct the first high-quality multilingual multimodal dataset that jointly preserves native cultural context and ensures multimodal alignment—by fusing web-native alt-text with MLLM-generated image captions, and jointly optimizing for multimodal alignment and instruction tuning. Our key innovation is the “thick description” objective, realized via a dual-source data strategy and a dual-objective training framework that simultaneously strengthens linguistic competence and cultural perception. Experiments across eight low-resource languages demonstrate consistent and significant improvements across multiple MLLM backbones, yielding cross-modal responses with enhanced cultural depth and semantic richness. Ablation studies confirm that these gains stem from the synergistic language–culture enhancement mechanism.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have shown remarkable performance in high-resource languages. However, their effectiveness diminishes significantly in the contexts of low-resource languages. Current multilingual enhancement methods are often limited to text modality or rely solely on machine translation. While such approaches help models acquire basic linguistic capabilities and produce "thin descriptions", they neglect the importance of multimodal informativeness and cultural groundedness, both of which are crucial for serving low-resource language users effectively. To bridge this gap, in this study, we identify two significant objectives for a truly effective MLLM in low-resource language settings, namely 1) linguistic capability and 2) cultural groundedness, placing special emphasis on cultural awareness. To achieve these dual objectives, we propose a dual-source strategy that guides the collection of data tailored to each goal, sourcing native web alt-text for culture and MLLM-generated captions for linguistics. As a concrete implementation, we introduce MELLA, a multimodal, multilingual dataset. Experiment results show that after fine-tuning on MELLA, there is a general performance improvement for the eight languages on various MLLM backbones, with models producing "thick descriptions". We verify that the performance gains are from both cultural knowledge enhancement and linguistic capability enhancement. Our dataset can be found at https://opendatalab.com/applyMultilingualCorpus.
Problem

Research questions and friction points this paper is trying to address.

Enhancing MLLMs for low-resource languages' linguistic capability
Improving cultural groundedness in low-resource language MLLMs
Addressing lack of multimodal data for low-resource languages
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-source strategy for data collection
Native web alt-text for cultural awareness
MLLM-generated captions for linguistic capability
Yufei Gao
Yufei Gao
Zhengzhou University
Machine learningMedical Image Analysis
J
Jiaying Fei
Shanghai Artificial Intelligence Laboratory
N
Nuo Chen
The Chinese University of Hong Kong, Shenzhen
R
Ruirui Chen
Institute of High Performance Computing, A*STAR
G
Guohang Yan
Shanghai Artificial Intelligence Laboratory
Y
Yunshi Lan
East China Normal University
Botian Shi
Botian Shi
Shanghai Artificial Intelligence Laboratory
VLMsDocument UnderstandingAutonomous Driving