🤖 AI Summary
Low-resource machine translation (MT) suffers from inadequate demand characterization and corpus domain bias—particularly critical in high-stakes domains like education, where accuracy in high- to low-resource language translation (e.g., English → Tetun) remains severely lacking. To address this, we conduct the first large-scale behavioral analysis of real-world MT usage, leveraging 100,000 anonymized translation logs from tetun.org—a platform predominantly used by mobile-device-based students in Timor-Leste. Our analysis reveals that user demand is highly concentrated in education, healthcare, and daily life, with a strong directional preference for high-resource → Tetun translation; existing news-centric parallel corpora exhibit severe domain mismatch with actual usage, and educational texts constitute the highest-priority category. Based on these findings, we propose a novel MT evaluation and optimization paradigm explicitly tailored to educational applications and high-to-low-resource translation directions. This work establishes an empirical foundation for institutionalizing MT system design for under-resourced, minority languages.
📝 Abstract
Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of tetun.org, a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction.More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.