MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing large vision-language models exhibit limited robustness in end-to-end text-in-image machine translation (TIMT) for diverse real-world scenarios and low-resource languages, and lack a systematic evaluation benchmark. To address this gap, this work introduces MMTIT-Bench, the first benchmark comprising 1,400 human-verified images supporting 14 non-English, non-Chinese languages across multiple scenes. Furthermore, we propose CPR-Trans, a unified cognition-perception-reasoning framework that integrates scene understanding, textual perception, and translation reasoning. By leveraging vision-language model–driven data generation, structured reasoning supervision, chain-of-thought enhancement, and multilingual image-text alignment, CPR-Trans significantly improves both translation accuracy and interpretability on both 3B and 7B model scales, demonstrating its effectiveness.

Technology Category

Application Category

📝 Abstract

End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

text-image machine translation

multilingual benchmark

low-resource languages

visual scene robustness

end-to-end evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-image machine translation

multilingual benchmark

vision-language reasoning