Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation

📅 2025-07-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Document Image Machine Translation (DIMT) suffers from insufficient training data and complex image-text modality coupling, leading to poor generalization. To address this, we propose M4Doc—a novel framework that establishes an alignment mechanism between a unimodal image encoder and a multimodal large language model (MLLM). During pretraining, M4Doc leverages the MLLM to inject joint image-text knowledge, enabling cross-modal fusion of visual and textual representations. At inference, only a lightweight image encoder is required—eliminating the need for MLLM invocation—thus ensuring efficiency and practical deployability. M4Doc is trained end-to-end on large-scale document image data. Experiments demonstrate substantial improvements in translation quality across multiple benchmarks, with particularly strong generalization to out-of-domain scenarios and documents featuring complex layouts.

Technology Category

Application Category

📝 Abstract
Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.
Problem

Research questions and friction points this paper is trying to address.

Aligns image-only encoder with multimodal MLLM representations
Improves translation quality in document image scenarios
Addresses generalization challenges in DIMT with limited data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-to-mix modality alignment with MLLM
Lightweight model learns visual-textual correlations
Bypasses MLLM during inference for efficiency
🔎 Similar Papers
No similar papers found.
Y
Yupu Liang
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China
Y
Yaping Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhiyang Zhang
Zhiyang Zhang
Nanjing University
NLPLLMAgentAIOps
Y
Yang Zhao
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China
Lu Xiang
Lu Xiang
Institute of Automation, Chinese Academy of Sciences
Dialogue SystemsNLP
C
Chengqing Zong
State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing, China
Y
Yu Zhou
Fanyu AI Laboratory, Zhongke Fanyu Technology Co., Ltd, Beijing, China