🤖 AI Summary
African languages suffer from a scarcity of high-quality document-level bilingual data and systematic evaluation benchmarks. Method: We introduce AfriDoc, the first document-level multilingual parallel corpus for African languages, comprising 605 human-translated documents across medical and IT news domains, pairing English with Amharic, Hausa, Swahili, Yoruba, and Zulu. We establish the first document-level machine translation benchmark for African languages and propose output re-alignment to enable reliable document-level evaluation. Contribution/Results: Our benchmark reveals critical generalization bottlenecks in both neural machine translation (NMT) and large language models (LLMs) for long-document translation—including under-generation, repetition, and bias. Experiments show NLLB-200 achieves the best performance among standard NMT systems; GPT-4o significantly outperforms generic LLMs; fine-tuning yields substantial gains; yet sentence-level trained models remain inadequate for ensuring cross-sentence coherence in document-level translation.
📝 Abstract
This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yor`ub'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.