Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of machine translation for high-resolution, text-dense images, where cluttered layouts, diverse fonts, and non-textual distractions often lead to missing text and semantic distortion. To tackle these issues, the authors propose GLoTran, a novel framework that introduces a global–local dual visual perception paradigm. By leveraging an instruction-guided alignment strategy, GLoTran effectively fuses low-resolution global scene context with multi-scale local text crops, enabling multimodal large language models to preserve both scene-level coherence and fine-grained textual details. The study also contributes GLoD, a large-scale dataset comprising 510,000 high-resolution image–text pairs. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods in translation completeness and accuracy, advancing the field of fine-grained text-rich image translation.

Technology Category

Application Category

📝 Abstract
Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.
Problem

Research questions and friction points this paper is trying to address.

Text Image Machine Translation
high-resolution
text-rich images
multimodal large language models
visual perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-Local Perception
Text Image Machine Translation
Multimodal Large Language Models
High-Resolution Text-Rich Images
Instruction-Guided Alignment
🔎 Similar Papers
No similar papers found.
J
Junxin Lu
School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
Tengfei Song
Tengfei Song
Huawei
Emotion recognitionComputer visionGraph neural network
Zhanglin Wu
Zhanglin Wu
2012 Lab, Huawei Co. LTD
Machine TranslationNatural Language Processing
P
Pengfei Li
2012 Labs, Huawei Technologies Co., LTD, China
X
Xiaowei Liang
2012 Labs, Huawei Technologies Co., LTD, China
H
Hui Yang
2012 Labs, Huawei Technologies Co., LTD, China
K
Kun Chen
2012 Labs, Huawei Technologies Co., LTD, China
Ning Xie
Ning Xie
Huawei Technologies
NLPAI
Yunfei Lu
Yunfei Lu
Huawei
Large Language ModelMachine TranslationData Mining
Jing Zhao
Jing Zhao
Department of Computer and Systems Sciences (DSV), Stockholm University
Machine LearningData MiningHealth Informatics
Shiliang Sun
Shiliang Sun
Shanghai Jiao Tong University
Machine LearningArtificial Intelligence
D
Daimeng Wei
2012 Labs, Huawei Technologies Co., LTD, China