MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit significant deficiencies in multimodal understanding and reasoning within non-Western, resource-constrained cultural contexts—particularly across Asia—revealing critical gaps in cultural cognition and reliance on superficial, shortcut-based learning. Method: We introduce the first Asian-culture-focused, multilingual multimodal alignment evaluation framework, covering eight countries, ten languages, and 27,000 multiple-choice questions. It enables text-image-speech tri-modal input-level alignment and proposes a five-dimensional evaluation protocol with a dedicated cultural cognition verification module. Leveraging human-annotated data, cross-modal consistency testing, attention tracking, and Visual Prefix Replay (VPR)—a novel visual ablation technique—we systematically diagnose model limitations. Contribution/Results: Our framework establishes a reproducible, culturally grounded benchmark for multimodal LLMs and delivers actionable insights for developing culturally reliable models, directly addressing alignment failures in underrepresented sociocultural settings.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are now used worldwide, yet their multimodal understanding and reasoning often degrade outside Western, high-resource settings. We propose MMA-ASIA, a comprehensive framework to evaluate LLMs' cultural awareness with a focus on Asian contexts. MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned multiple-choice benchmark covering 8 Asian countries and 10 languages, comprising 27,000 questions; over 79 percent require multi-step reasoning grounded in cultural context, moving beyond simple memorization. To our knowledge, this is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech. This enables direct tests of cross-modal transfer. Building on this benchmark, we propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity. To ensure rigorous assessment, a Cultural Awareness Grounding Validation Module detects "shortcut learning" by checking whether the requisite cultural knowledge supports correct answers. Finally, through comparative model analysis, attention tracing, and an innovative Vision-ablated Prefix Replay (VPR) method, we probe why models diverge across languages and modalities, offering actionable insights for building culturally reliable multimodal LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' cultural awareness degradation in non-Western contexts
Assessing multimodal alignment across text, image, and speech modalities
Measuring cultural knowledge generalization and cross-modal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal aligned benchmark across text, image, speech
Five-dimensional evaluation protocol for cultural awareness
Vision-ablated Prefix Replay method probes model divergence
🔎 Similar Papers
No similar papers found.
Weihua Zheng
Weihua Zheng
A*STAR
Multilingual LLMCultural LLM
Zhengyuan Liu
Zhengyuan Liu
Institute for Infocomm Research (I2R) - A*STAR; IEEE Senior Member.
Natural Language ProcessingArtificial IntelligenceHuman-Centered AI
T
Tanmoy Chakraborty
Indian Institute of Technology Delhi
Weiwen Xu
Weiwen Xu
The Chinese University of Hong Kong
natural language processing
Xiaoxue Gao
Xiaoxue Gao
Research Scientist, I2R, A*STAR; National University of Singapore; IEEE Senior Member
Generative AISpeechLarge language models
B
Bryan Chen Zhengyu Tan
Singapore University of Technology and Design
B
Bowei Zou
Agency for Science, Technology and Research, Singapore
C
Chang Liu
Shanghai University of Finance and Economics
Y
Yujia Hu
Singapore University of Technology and Design
X
Xing Xie
Microsoft Research Asia
Xiaoyuan Yi
Xiaoyuan Yi
Senior Researcher, Microsoft Research Asia
Natural Language GenerationSocietal AILarge Language ModelResponsible AI
J
Jing Yao
Microsoft Research Asia
C
Chaojun Wang
Alibaba DAMO Academy
Long Li
Long Li
Research Staff Member, Inspur Group Co., Ltd.
Software Defined NetworkingNetwork Performance Optimization
R
Rui Liu
Inner Mongolia University
H
Huiyao Liu
Inner Mongolia University
Koji Inoue
Koji Inoue
Kyoto University
Spoken Dialogue SystemHuman-Robot InteractionTurn-Taking
R
Ryuichi Sumida
Kyoto University
Tatsuya Kawahara
Tatsuya Kawahara
Professor, School of Informatics, Kyoto University
Speech Processingspeech recognitionNatural Language Processingdialogue
F
Fan Xu
Jiangxi Normal University
L
Lingyu Ye
Jiangxi Normal University
W
Wei Tian
Jiangxi Normal University
Dongjun Kim
Dongjun Kim
Stanford University
Machine LearningArtificial Intelligence
J
Jimin Jung
Korea University
Jaehyung Seo
Jaehyung Seo
Korea University
Natural Language GenerationCommonsense ReasoningHallucinationKnowledge Editing