π€ AI Summary
To address the scarcity of training corpora for low-resource languages (e.g., Uyghur, Tibetan) in zero-shot cross-lingual transfer of large language models (LLMs), this paper introduces and open-sources CUTE, a multilingual dataset comprising Chinese, English, Uyghur, and Tibetanβ25 GB of parallel and 25 GB of non-parallel text, constituting the largest publicly available Uyghur/Tibetan corpus to date. The data is generated via high-quality machine translation and rigorously validated through human evaluation for reliability. We systematically investigate how corpus parallelism affects cross-lingual transfer performance. Experiments demonstrate that fine-tuning LLMs on CUTE significantly improves zero-shot understanding and generation capabilities in Uyghur and Tibetan. Our results empirically validate that large-scale synthetic corpora can effectively bridge training data gaps for low-resource languages, establishing CUTE as critical infrastructure for cross-lingual LLM research and providing robust evidence for data-centric approaches to multilingual model development.
π Abstract
Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.