MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

πŸ“… 2026-03-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing large vision-language models exhibit limited robustness in end-to-end text-in-image machine translation (TIMT) for diverse real-world scenarios and low-resource languages, and lack a systematic evaluation benchmark. To address this gap, this work introduces MMTIT-Bench, the first benchmark comprising 1,400 human-verified images supporting 14 non-English, non-Chinese languages across multiple scenes. Furthermore, we propose CPR-Trans, a unified cognition-perception-reasoning framework that integrates scene understanding, textual perception, and translation reasoning. By leveraging vision-language model–driven data generation, structured reasoning supervision, chain-of-thought enhancement, and multilingual image-text alignment, CPR-Trans significantly improves both translation accuracy and interpretability on both 3B and 7B model scales, demonstrating its effectiveness.

Technology Category

Application Category

πŸ“ Abstract
End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

text-image machine translation
multilingual benchmark
low-resource languages
visual scene robustness
end-to-end evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-image machine translation
multilingual benchmark
vision-language reasoning
Cognition-Perception-Reasoning
Chain-of-Thought
πŸ”Ž Similar Papers
No similar papers found.