TULUN: Transparent and Adaptable Low-resource Machine Translation

๐Ÿ“… 2025-05-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Low-resource languages exhibit suboptimal performance in domain-specific translation, and existing domain adaptation methods rely on model fine-tuningโ€”rendering them inaccessible to non-technical users and small organizations. This paper proposes a zero-fine-tuning, terminology-editable, human-in-the-loop transparent adaptation framework that synergistically integrates neural machine translation (NMT) with large language model (LLM)-driven, terminology-aware post-editing. An open-source web platform enables collaborative construction and reuse of terminology databases and translation memory repositories. The approach eliminates the need for model retraining, significantly enhancing domain controllability and usability. On medical and disaster-response translation tasks for Tetum and Bislama, it achieves ChrF++ gains of 16.90โ€“22.41 points; on the FLORES benchmark across six low-resource languages, it outperforms NLLB-54B by an average of 2.8 ChrF points.

Technology Category

Application Category

๐Ÿ“ Abstract
Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.
Problem

Research questions and friction points this paper is trying to address.

Enhancing low-resource machine translation for specialized domains
Eliminating fine-tuning needs for non-technical users and small organizations
Improving translation accuracy using terminology-aware human-machine collaboration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines neural MT with LLM-based post-editing
Uses glossaries and translation memories for guidance
Open-source web platform for collaborative terminology management
๐Ÿ”Ž Similar Papers
No similar papers found.