DocHPLT: A Massively Multilingual Document-Level Translation Dataset

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing document-level machine translation (MT) datasets suffer from narrow language coverage, limited scale, and reliance on sentence-level alignment for reconstruction—hindering realistic document structure modeling and multilingual long-context research. To address this, we introduce DocHPLT, the largest publicly available document-level bilingual dataset to date, comprising 124 million document pairs across 50 languages paired with English (4.26 billion sentences). We innovatively enhance web crawling and document alignment pipelines to preserve original document structures—including unaligned segments—enabling the first large-scale acquisition of authentic document-level bilingual data. Fine-tuning large language models on DocHPLT yields substantial improvements over instruction-tuned baselines on multilingual document translation, especially for low-resource languages. DocHPLT thus establishes a critical infrastructure for advancing long-context modeling and multilingual document-level MT research.

Technology Category

Application Category

📝 Abstract

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences, with further possibility to provide 2500 bonus pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited multilingual document-level translation resources beyond high-resourced languages

Creating large-scale aligned document pairs preserving complete source integrity

Enabling better document-level translation performance especially for under-resourced languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modify web extraction pipeline preserving document integrity

Identify optimal training context strategy for translation

Fine-tune LLMs on multilingual document-level dataset

🔎 Similar Papers

No similar papers found.