OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

📅 2025-08-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The absence of document-level evaluation benchmarks hinders progress in machine translation (MT) for low-resource languages in the health domain. Method: We introduce OpenWHO, the first parallel corpus specifically designed for this scenario—comprising 2,978 official WHO documents and 26,824 sentences across 20+ languages (including nine low-resource ones), all expert-authored and professionally translated. We conduct the first systematic comparison of large language models (e.g., Gemini 2.5 Flash) and conventional MT systems (e.g., NLLB-54B) on document-level health text translation, explicitly evaluating their capacity to leverage long-range contextual information. Results: Gemini 2.5 Flash achieves a +4.79 ChrF improvement over NLLB-54B on low-resource test sets, demonstrating superior accuracy through effective contextual modeling in specialized domains. OpenWHO fills a critical gap in low-resource health MT evaluation and provides essential infrastructure for research on domain adaptation and document-level context integration.

Technology Category

Application Category

📝 Abstract
In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.
Problem

Research questions and friction points this paper is trying to address.

Lack of MT evaluation datasets for low-resource health translation
Addressing scarcity of document-level parallel corpora in healthcare
Evaluating LLM performance against traditional MT models in health domain
Innovation

Methods, ideas, or system contributions that make the work stand out.

Document-level parallel corpus for health translation
Evaluating LLMs versus traditional MT models
Investigating context utilisation for accuracy improvement
🔎 Similar Papers
No similar papers found.