🤖 AI Summary
Supervised text generation tasks—such as headline generation—are severely hindered for Chinese minority languages (e.g., Tibetan, Uyghur, Mongolian) due to their non-Latin scripts and acute scarcity of high-quality parallel corpora.
Method: We introduce CMHG, the first large-scale multilingual headline generation dataset, comprising 100K Tibetan and 50K each of Uyghur and Mongolian news–headline parallel sentence pairs. All data underwent rigorous manual curation, multi-stage cleaning, and expert native-speaker annotation to ensure high fidelity and linguistic accuracy.
Contribution/Results: CMHG establishes the first reproducible, standardized evaluation benchmark for headline generation on non-Latin, low-resource languages. It fills dual gaps—lack of supervised training data and absence of community-accepted evaluation protocols—thereby substantially enhancing model trainability and advancing research standardization in this under-resourced domain.
📝 Abstract
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.