DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
End-to-end machine translation of complex-layout document images remains challenging due to the need to jointly handle optical character recognition (OCR), layout understanding, and cross-lingual generation. Method: This paper proposes the first unified framework supporting both OCR-based and OCR-free translation scenarios. Built upon open-source large vision-language models (LVLMs), it introduces a novel training paradigm integrating multi-task learning with perceptual chain-of-thought reasoning to jointly optimize text recognition, layout comprehension, and target-language generation. It further incorporates minimum-Bayes decoding and customized post-processing to enhance structural fidelity. Contribution/Results: Evaluated on the ICDAR 2025 DIMT25 benchmark, the method achieves state-of-the-art performance, significantly improving translation accuracy and layout consistency. It is the first approach to realize high-fidelity, end-to-end translation from input document images to output target-language text.

Technology Category

Application Category

📝 Abstract
This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the"End-to-End Document Image Machine Translation for Complex Layouts"competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.
Problem

Research questions and friction points this paper is trying to address.

Develops end-to-end document image translation system
Addresses OCR-based and OCR-free translation tasks
Enhances translation using multi-task learning and LVLM
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large vision-language model (LVLM)
Multi-task learning with perceptual chain-of-thought
Minimum Bayesian decoding and post-processing
🔎 Similar Papers
Zhanglin Wu
Zhanglin Wu
2012 Lab, Huawei Co. LTD
Machine TranslationNatural Language Processing
Tengfei Song
Tengfei Song
Huawei
Emotion recognitionComputer visionGraph neural network
N
Ning Xie
Huawei Translation Service Center, Nanjing, China
Weidong Zhang
Weidong Zhang
Samsung Research America
Computer VisionImage Processing
P
Pengfei Li
Huawei Translation Service Center, Nanjing, China
S
Shuang Wu
Huawei Translation Service Center, Nanjing, China
C
Chong Li
Huawei Translation Service Center, Nanjing, China
Junhao Zhu
Junhao Zhu
Zhejiang University
Data Lake ManagementData Integration
H
Hao Yang
Huawei Translation Service Center, Nanjing, China