PRIM: Towards Practical In-Image Multilingual Machine Translation

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Current image-based machine translation (IIMT) research heavily relies on synthetic data, failing to model real-world challenges such as complex backgrounds, diverse fonts, and variable text layouts—leading to a significant gap between academic progress and practical applicability. To address this, we present the first systematic study of multilingual image-to-image machine translation (IIMMT) on authentic images. We introduce PRIM, the first high-quality, real-world, single-line text dataset for IIMMT. Furthermore, we propose VisTrans, an end-to-end model that explicitly decouples semantic text representation from background visual information. VisTrans integrates an OCR-aware module, a background-aware enhancement mechanism, and multilingual shared representation learning. On PRIM, VisTrans substantially outperforms existing methods in both translation accuracy and generated image fidelity, while supporting cross-lingual translation among multiple languages. Both the code and the PRIM dataset are publicly released.

Technology Category

Application Category

📝 Abstract

In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.

Problem

Research questions and friction points this paper is trying to address.

Develops practical in-image multilingual translation for real-world images

Addresses limitations of synthetic data with complex backgrounds

Handles diverse fonts and text positions in captured images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world annotated dataset for multilingual translation

End-to-end model separates text and background processing

Improves translation quality and visual effect output

🔎 Similar Papers

No similar papers found.