Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Document Image Machine Translation (DIMT) suffers from insufficient training data and complex image-text modality coupling, leading to poor generalization. To address this, we propose M4Doc—a novel framework that establishes an alignment mechanism between a unimodal image encoder and a multimodal large language model (MLLM). During pretraining, M4Doc leverages the MLLM to inject joint image-text knowledge, enabling cross-modal fusion of visual and textual representations. At inference, only a lightweight image encoder is required—eliminating the need for MLLM invocation—thus ensuring efficiency and practical deployability. M4Doc is trained end-to-end on large-scale document image data. Experiments demonstrate substantial improvements in translation quality across multiple benchmarks, with particularly strong generalization to out-of-domain scenarios and documents featuring complex layouts.

Technology Category

Application Category

📝 Abstract

Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.

Problem

Research questions and friction points this paper is trying to address.

Aligns image-only encoder with multimodal MLLM representations

Improves translation quality in document image scenarios

Addresses generalization challenges in DIMT with limited data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-to-mix modality alignment with MLLM

Lightweight model learns visual-textual correlations

Bypasses MLLM during inference for efficiency

🔎 Similar Papers

No similar papers found.