CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cultural disparities impede machine translation (MT) performance on culture-specific items (CSIs), ambiguity resolution, and gender agreement, as textual input often lacks sufficient contextual cues. To address this, we propose a multimodal translation paradigm leveraging images as cultural context. We introduce CaMMT—the first cultural-aware multimodal MT benchmark—comprising over 5,800 English-to-regional-language image–text–translation triplets. We systematically define and evaluate the added value of visual modality for cross-cultural translation, proposing a culture-sensitive evaluation framework covering CSIs, ambiguity, and gender dimensions. Using five vision-language models, we conduct controlled ablation studies comparing image-text joint translation against text-only baselines. Results demonstrate that visual context significantly enhances translation quality: human evaluations show an average 12.3% improvement in CSI handling, alongside measurable gains in ambiguity resolution and gender accuracy.

Technology Category

Application Category

📝 Abstract
Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.
Problem

Research questions and friction points this paper is trying to address.

Challenges in translating cultural content due to conceptual differences
Investigating images as cultural context for multimodal translation
Evaluating VLMs with cultural-specific items and disambiguation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Images as cultural context in translation
Human-curated multimodal benchmark dataset
Vision Language Models for cultural nuance
🔎 Similar Papers
No similar papers found.
E
Emilio Villa-Cueva
Core Authors (MBZUAI)
S
Sholpan Bolatzhanova
Core Authors (MBZUAI)
Diana Turmakhan
Diana Turmakhan
MBZUAI
Low-Resource NLPMulti-modal NLP
K
Kareem Elzeky
Core Authors (MBZUAI)
Henok Biadglign Ademtew
Henok Biadglign Ademtew
Researcher
Deep LearningMultimodalNLP
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
Israel Abebe Azime
Israel Abebe Azime
Saarland University
NLP | Multimodal learning | Deep Learning Applications
Jinheon Baek
Jinheon Baek
Ph.D. student, KAIST
Machine LearningNatural Language ProcessingRAG
Frederico Belcavello
Frederico Belcavello
Federal University of Juiz de Fora | FrameNet Brasil Computational Linguistics Lab
linguisticscommunicationsframe semanticsTV
F
Fermin Cristobal
Jan Christian Blaise Cruz
Jan Christian Blaise Cruz
MBZUAI, McGill University, Mila - Quebec AI Institute
Natural Language ProcessingTranslationMultilingualityLow-resource LanguagesCode Switching
M
Mary Dabre
Raj Dabre
Raj Dabre
Researcher@NICT (Japan), Adjunct Faculty@IIT Madras/AI4Bharat (India)
Artificial IntelligenceMachine TranslationNatural Language ProcessingGenetics
Toqeer Ehsan
Toqeer Ehsan
Teknologian tutkimuskeskus VTT Oy
Natural Language ProcessingDeep LearningArtificial Intelligence
Naome A. Etori
Naome A. Etori
Department of Computer Science and Engineering, University of Minnesota-Twin Cities
AINLPHealthcareHCIComputational Social Science
F
Fauzan Farooqui
Jiahui Geng
Jiahui Geng
Mohamed bin Zayed University of Artificial Intelligence
Artificial IntelligenceNatural Language Processing
Guido Ivetta
Guido Ivetta
Universidad Nacional de Córdoba, Argentina / Fundación Vía Libre
CalibrationBias in LLMs
T
Thanmay Jayakumar
Soyeong Jeong
Soyeong Jeong
Ph.D. student at KAIST
NLPRAGIRLLMs
Zheng Wei Lim
Zheng Wei Lim
The University of Melbourne
Natural Language ProcessingMachine TranslationMultilinguality
Aishik Mandal
Aishik Mandal
Doctoral Student, Technical University of Darmstadt
Machine LearningNatural Language ProcessingDialogue SystemsMultimodalityPrivacy
S
Sof'ia Martinelli
M
Mihail Minkov Mihaylov
Daniil Orel
Daniil Orel
MBZUAI
Astrophysicslow-resource NLPIoT
A
Aniket Pramanick
Sukannya Purkayastha
Sukannya Purkayastha
Technische Universität Darmstadt
Natural Language ProcessingDeep LearningMachine Learning
I
Israfel Salazar
H
Haiyue Song
T
T. Torrent
D
D. Yadeta
Injy Hamed
Injy Hamed
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Code-switchingNLPSpeech RecognitionMachine Translation
A
A. Tonja
Core Authors (MBZUAI)
T
T. Solorio
Core Authors (MBZUAI)