Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the challenges of machine translation for high-resolution, text-dense images, where cluttered layouts, diverse fonts, and non-textual distractions often lead to missing text and semantic distortion. To tackle these issues, the authors propose GLoTran, a novel framework that introduces a global–local dual visual perception paradigm. By leveraging an instruction-guided alignment strategy, GLoTran effectively fuses low-resolution global scene context with multi-scale local text crops, enabling multimodal large language models to preserve both scene-level coherence and fine-grained textual details. The study also contributes GLoD, a large-scale dataset comprising 510,000 high-resolution image–text pairs. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods in translation completeness and accuracy, advancing the field of fine-grained text-rich image translation.

Technology Category

Application Category

📝 Abstract
Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.
Problem

Research questions and friction points this paper is trying to address.

Text Image Machine Translation
high-resolution
text-rich images
multimodal large language models
visual perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global-Local Perception
Text Image Machine Translation
Multimodal Large Language Models
High-Resolution Text-Rich Images
Instruction-Guided Alignment
🔎 Similar Papers
No similar papers found.