CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages

πŸ“… 2025-09-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of training corpora for low-resource languages (e.g., Uyghur, Tibetan) in zero-shot cross-lingual transfer of large language models (LLMs), this paper introduces and open-sources CUTE, a multilingual dataset comprising Chinese, English, Uyghur, and Tibetanβ€”25 GB of parallel and 25 GB of non-parallel text, constituting the largest publicly available Uyghur/Tibetan corpus to date. The data is generated via high-quality machine translation and rigorously validated through human evaluation for reliability. We systematically investigate how corpus parallelism affects cross-lingual transfer performance. Experiments demonstrate that fine-tuning LLMs on CUTE significantly improves zero-shot understanding and generation capabilities in Uyghur and Tibetan. Our results empirically validate that large-scale synthetic corpora can effectively bridge training data gaps for low-resource languages, establishing CUTE as critical infrastructure for cross-lingual LLM research and providing robust evidence for data-centric approaches to multilingual model development.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs' ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.
Problem

Research questions and friction points this paper is trying to address.

Addressing inadequate LLM support for low-resource languages
Mitigating training data scarcity for Uyghur and Tibetan languages
Enhancing cross-lingual knowledge transfer through parallel corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed multilingual parallel corpus through machine translation
Validated translation quality approaching human-level performance
Enhanced LLM capabilities for low-resource language processing
πŸ”Ž Similar Papers
No similar papers found.
Wenhao Zhuang
Wenhao Zhuang
Kuaishou Technology
Natural Language Processing
Y
Yuan Sun
Minzu University of China, Beijing, China