🤖 AI Summary
Document Image Machine Translation (DIMT) suffers from insufficient training data and complex image-text modality coupling, leading to poor generalization. To address this, we propose M4Doc—a novel framework that establishes an alignment mechanism between a unimodal image encoder and a multimodal large language model (MLLM). During pretraining, M4Doc leverages the MLLM to inject joint image-text knowledge, enabling cross-modal fusion of visual and textual representations. At inference, only a lightweight image encoder is required—eliminating the need for MLLM invocation—thus ensuring efficiency and practical deployability. M4Doc is trained end-to-end on large-scale document image data. Experiments demonstrate substantial improvements in translation quality across multiple benchmarks, with particularly strong generalization to out-of-domain scenarios and documents featuring complex layouts.
📝 Abstract
Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.