CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Cultural disparities impede machine translation (MT) performance on culture-specific items (CSIs), ambiguity resolution, and gender agreement, as textual input often lacks sufficient contextual cues. To address this, we propose a multimodal translation paradigm leveraging images as cultural context. We introduce CaMMT—the first cultural-aware multimodal MT benchmark—comprising over 5,800 English-to-regional-language image–text–translation triplets. We systematically define and evaluate the added value of visual modality for cross-cultural translation, proposing a culture-sensitive evaluation framework covering CSIs, ambiguity, and gender dimensions. Using five vision-language models, we conduct controlled ablation studies comparing image-text joint translation against text-only baselines. Results demonstrate that visual context significantly enhances translation quality: human evaluations show an average 12.3% improvement in CSI handling, alongside measurable gains in ambiguity resolution and gender accuracy.

Technology Category

Application Category

📝 Abstract

Cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender usage. By releasing CaMMT, we aim to support broader efforts in building and evaluating multimodal translation systems that are better aligned with cultural nuance and regional variation.

Problem

Research questions and friction points this paper is trying to address.

Challenges in translating cultural content due to conceptual differences

Investigating images as cultural context for multimodal translation

Evaluating VLMs with cultural-specific items and disambiguation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Images as cultural context in translation

Human-curated multimodal benchmark dataset

Vision Language Models for cultural nuance

🔎 Similar Papers

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning