🤖 AI Summary
This study addresses the problem of end-to-end machine translation of text embedded in images, bridging computer vision and natural language processing. It presents the first systematic comparison among three methodological paradigms: modular pipelines that decouple text detection, recognition, and translation; multimodal large language models (MLLMs); and end-to-end image-to-translation models such as Translatotron-V. The experimental framework employs docTR for optical character recognition and integrates multilingual language models including Llama and EuroLLM, while also evaluating various configurations of Gemini 2.5 as representative MLLMs. Translation quality is assessed using BLEU, chrF, and TER metrics. Results demonstrate that modular approaches consistently outperform end-to-end architectures, whereas MLLMs achieve overall superior performance, attributed to their enhanced contextual understanding and cross-lingual visual reasoning capabilities.
📝 Abstract
This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.