CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China

📅 2025-09-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised text generation tasks—such as headline generation—are severely hindered for Chinese minority languages (e.g., Tibetan, Uyghur, Mongolian) due to their non-Latin scripts and acute scarcity of high-quality parallel corpora. Method: We introduce CMHG, the first large-scale multilingual headline generation dataset, comprising 100K Tibetan and 50K each of Uyghur and Mongolian news–headline parallel sentence pairs. All data underwent rigorous manual curation, multi-stage cleaning, and expert native-speaker annotation to ensure high fidelity and linguistic accuracy. Contribution/Results: CMHG establishes the first reproducible, standardized evaluation benchmark for headline generation on non-Latin, low-resource languages. It fills dual gaps—lack of supervised training data and absence of community-accepted evaluation protocols—thereby substantially enhancing model trainability and advancing research standardization in this under-resourced domain.

Technology Category

Application Category

📝 Abstract
Minority languages in China, such as Tibetan, Uyghur, and Traditional Mongolian, face significant challenges due to their unique writing systems, which differ from international standards. This discrepancy has led to a severe lack of relevant corpora, particularly for supervised tasks like headline generation. To address this gap, we introduce a novel dataset, Chinese Minority Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and 50,000 entries each for Uyghur and Mongolian, specifically curated for headline generation tasks. Additionally, we propose a high-quality test set annotated by native speakers, designed to serve as a benchmark for future research in this domain. We hope this dataset will become a valuable resource for advancing headline generation in Chinese minority languages and contribute to the development of related benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Addressing headline generation for Chinese minority languages
Creating benchmark dataset for Tibetan, Uyghur, Mongolian languages
Solving lack of supervised corpora for non-standard writing systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset for minority language headline generation
Native speaker annotated benchmark test set
Curated entries for Tibetan, Uyghur, Mongolian
🔎 Similar Papers
No similar papers found.
Guixian Xu
Guixian Xu
Nokia
Wireless communication and algorithm
Z
Zeli Su
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China
Ziyin Zhang
Ziyin Zhang
Shanghai Jiao Tong University
Artificial IntelligenceNatural Language ProcessingLarge Language Models
J
Jianing Liu
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China
X
XU Han
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China
T
Ting Zhang
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China
Y
Yushuang Dong
Key Laboratory of Ethnic Language Intelligent Analysis and Security Governance of MOE, Minzu University of China