🤖 AI Summary
End-to-end machine translation of complex-layout document images remains challenging due to the need to jointly handle optical character recognition (OCR), layout understanding, and cross-lingual generation. Method: This paper proposes the first unified framework supporting both OCR-based and OCR-free translation scenarios. Built upon open-source large vision-language models (LVLMs), it introduces a novel training paradigm integrating multi-task learning with perceptual chain-of-thought reasoning to jointly optimize text recognition, layout comprehension, and target-language generation. It further incorporates minimum-Bayes decoding and customized post-processing to enhance structural fidelity. Contribution/Results: Evaluated on the ICDAR 2025 DIMT25 benchmark, the method achieves state-of-the-art performance, significantly improving translation accuracy and layout consistency. It is the first approach to realize high-fidelity, end-to-end translation from input document images to output target-language text.
📝 Abstract
This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the"End-to-End Document Image Machine Translation for Complex Layouts"competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.