Comparative Evaluation of Machine Translation Systems on Images with Text

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the problem of end-to-end machine translation of text embedded in images, bridging computer vision and natural language processing. It presents the first systematic comparison among three methodological paradigms: modular pipelines that decouple text detection, recognition, and translation; multimodal large language models (MLLMs); and end-to-end image-to-translation models such as Translatotron-V. The experimental framework employs docTR for optical character recognition and integrates multilingual language models including Llama and EuroLLM, while also evaluating various configurations of Gemini 2.5 as representative MLLMs. Translation quality is assessed using BLEU, chrF, and TER metrics. Results demonstrate that modular approaches consistently outperform end-to-end architectures, whereas MLLMs achieve overall superior performance, attributed to their enhanced contextual understanding and cross-lingual visual reasoning capabilities.

📝 Abstract

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

Problem

Research questions and friction points this paper is trying to address.

machine translation

text in images

computer vision

natural language processing

multilingual

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal large language models

machine translation on images

modular pipeline