CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

📅 2025-09-12

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Supervised text generation tasks—such as headline generation—are severely hindered for Chinese minority languages (e.g., Tibetan, Uyghur, Mongolian) due to their non-Latin scripts and acute scarcity of high-quality parallel corpora. Method: We introduce CMHG, the first large-scale multilingual headline generation dataset, comprising 100K Tibetan and 50K each of Uyghur and Mongolian news–headline parallel sentence pairs. All data underwent rigorous manual curation, multi-stage cleaning, and expert native-speaker annotation to ensure high fidelity and linguistic accuracy. Contribution/Results: CMHG establishes the first reproducible, standardized evaluation benchmark for headline generation on non-Latin, low-resource languages. It fills dual gaps—lack of supervised training data and absence of community-accepted evaluation protocols—thereby substantially enhancing model trainability and advancing research standardization in this under-resourced domain.

Technology Category

Application Category

📝 Abstract

Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Addressing headline generation for Chinese minority languages

Creating benchmark dataset for Tibetan, Uyghur, Mongolian languages

Solving lack of supervised corpora for non-standard writing systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for minority language headline generation

Native speaker annotated benchmark test set

Curated entries for Tibetan, Uyghur, Mongolian

🔎 Similar Papers

No similar papers found.