Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

📅 2024-11-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Low-resource machine translation (MT) suffers from inadequate demand characterization and corpus domain bias—particularly critical in high-stakes domains like education, where accuracy in high- to low-resource language translation (e.g., English → Tetun) remains severely lacking. To address this, we conduct the first large-scale behavioral analysis of real-world MT usage, leveraging 100,000 anonymized translation logs from tetun.org—a platform predominantly used by mobile-device-based students in Timor-Leste. Our analysis reveals that user demand is highly concentrated in education, healthcare, and daily life, with a strong directional preference for high-resource → Tetun translation; existing news-centric parallel corpora exhibit severe domain mismatch with actual usage, and educational texts constitute the highest-priority category. Based on these findings, we propose a novel MT evaluation and optimization paradigm explicitly tailored to educational applications and high-to-low-resource translation directions. This work establishes an empirical foundation for institutionalizing MT system design for under-resourced, minority languages.

Technology Category

Application Category

📝 Abstract

Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of tetun.org, a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction.More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.

Problem

Research questions and friction points this paper is trying to address.

Understanding community needs in low-resource machine translation

Analyzing usage patterns of Tetun language translation service

Prioritizing accuracy in educational domains for minority languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Observational study on Tetun MT usage

Analyzed 100,000 translation requests

Prioritize high-to-low resource accuracy

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

No related jobs found.

Authors to Follow