PRIM: Towards Practical In-Image Multilingual Machine Translation

๐Ÿ“… 2025-09-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current image-based machine translation (IIMT) research heavily relies on synthetic data, failing to model real-world challenges such as complex backgrounds, diverse fonts, and variable text layoutsโ€”leading to a significant gap between academic progress and practical applicability. To address this, we present the first systematic study of multilingual image-to-image machine translation (IIMMT) on authentic images. We introduce PRIM, the first high-quality, real-world, single-line text dataset for IIMMT. Furthermore, we propose VisTrans, an end-to-end model that explicitly decouples semantic text representation from background visual information. VisTrans integrates an OCR-aware module, a background-aware enhancement mechanism, and multilingual shared representation learning. On PRIM, VisTrans substantially outperforms existing methods in both translation accuracy and generated image fidelity, while supporting cross-lingual translation among multiple languages. Both the code and the PRIM dataset are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.
Problem

Research questions and friction points this paper is trying to address.

Develops practical in-image multilingual translation for real-world images
Addresses limitations of synthetic data with complex backgrounds
Handles diverse fonts and text positions in captured images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-world annotated dataset for multilingual translation
End-to-end model separates text and background processing
Improves translation quality and visual effect output
๐Ÿ”Ž Similar Papers
No similar papers found.
Yanzhi Tian
Yanzhi Tian
Beijing Instituite of Technology
Machine TranslationLarge Language ModelsVision Language Models
Z
Zeming Liu
School of Computer Science and Engineering, Beihang University
Zhengyang Liu
Zhengyang Liu
Royal Melbourne Hospital, Parkville, Australia
OphthalmologyBiostatistics
C
Chong Feng
School of Computer Science and Technology, Beijing Institute of Technology
X
Xin Li
School of Computer Science and Technology, Beijing Institute of Technology
H
Heyan Huang
School of Computer Science and Technology, Beijing Institute of Technology
Y
Yuhang Guo
School of Computer Science and Technology, Beijing Institute of Technology