🤖 AI Summary
Existing PDF translation approaches struggle to simultaneously preserve semantic accuracy and layout fidelity, often failing due to the loss of structural metadata or an inability to reconstruct document layouts. This work proposes the first intermediate representation (IR)-based framework for PDF translation, which decouples visual layout from semantic content to enable document-level operations such as terminology extraction, cross-page context modeling, glossary-constrained generation, and formula placeholder handling. A custom adaptive typesetting engine then faithfully re-renders translated text onto the original layout. Experimental results on a 200-page benchmark demonstrate significant improvements in layout preservation, visual aesthetics, and terminological consistency, while maintaining high translation quality. The open-source implementation has garnered over 8.4K GitHub stars and contributions from 17 developers.
📝 Abstract
As global cross-lingual communication intensifies, language barriers in visually rich documents such as PDFs remain a practical bottleneck. Existing document translation pipelines face a tension between linguistic processing and layout preservation: text-oriented Computer-Assisted Translation (CAT) systems often discard structural metadata, while document parsers focus on extraction and do not support faithful re-rendering after translation. We introduce BabelDOC, an Intermediate Representation (IR)-based framework for layout-preserving PDF translation. BabelDOC decouples visual layout metadata from semantic content, enabling document-level translation operations such as terminology extraction, cross-page context handling, glossary-constrained generation, and formula placeholdering. The translated content is then re-anchored to the original layout through an adaptive typesetting engine. Experiments on a curated 200-page benchmark, together with human evaluation and multimodal LLM-as-a-judge evaluation, show that BabelDOC improves layout fidelity, visual aesthetics, and terminology consistency over representative baselines, while maintaining competitive translation precision. The open-source toolkit and its interactive downstream applications are publicly available and have attracted over 8.4K GitHub stars and 17 contributors at the time of writing. A demonstration video is also available.