DIMT25@ICDAR2025: HW-TSC's End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

End-to-end machine translation of complex-layout document images remains challenging due to the need to jointly handle optical character recognition (OCR), layout understanding, and cross-lingual generation. Method: This paper proposes the first unified framework supporting both OCR-based and OCR-free translation scenarios. Built upon open-source large vision-language models (LVLMs), it introduces a novel training paradigm integrating multi-task learning with perceptual chain-of-thought reasoning to jointly optimize text recognition, layout comprehension, and target-language generation. It further incorporates minimum-Bayes decoding and customized post-processing to enhance structural fidelity. Contribution/Results: Evaluated on the ICDAR 2025 DIMT25 benchmark, the method achieves state-of-the-art performance, significantly improving translation accuracy and layout consistency. It is the first approach to realize high-fidelity, end-to-end translation from input document images to output target-language text.

Technology Category

Application Category

📝 Abstract

This paper presents the technical solution proposed by Huawei Translation Service Center (HW-TSC) for the"End-to-End Document Image Machine Translation for Complex Layouts"competition at the 19th International Conference on Document Analysis and Recognition (DIMT25@ICDAR2025). Leveraging state-of-the-art open-source large vision-language model (LVLM), we introduce a training framework that combines multi-task learning with perceptual chain-of-thought to develop a comprehensive end-to-end document translation system. During the inference phase, we apply minimum Bayesian decoding and post-processing strategies to further enhance the system's translation capabilities. Our solution uniquely addresses both OCR-based and OCR-free document image translation tasks within a unified framework. This paper systematically details the training methods, inference strategies, LVLM base models, training data, experimental setups, and results, demonstrating an effective approach to document image machine translation.

Problem

Research questions and friction points this paper is trying to address.

Develops end-to-end document image translation system

Addresses OCR-based and OCR-free translation tasks

Enhances translation using multi-task learning and LVLM

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large vision-language model (LVLM)

Multi-task learning with perceptual chain-of-thought

Minimum Bayesian decoding and post-processing

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs