CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

๐Ÿ“… 2025-10-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Chinese large language models (LLMs) suffer from inadequate structured semantic representation, English-centric evaluation benchmarks, and a scarcity of high-quality, linguistically grounded structured assessment resources. To address these challenges, we introduce CDTPโ€”the first large-scale Chinese text-triple alignment dataset (7M pairs across four domains)โ€”and CB-ECLLM, the first comprehensive benchmark tailored for Chinese LLMs, supporting fine-grained multi-task evaluation including knowledge graph completion, triple generation, and question answering. Methodologically, we propose an integrated framework combining text-triple alignment construction, multi-task supervised fine-tuning, and ablation-aware analysis. Extensive experiments demonstrate substantial improvements in Chinese LLMsโ€™ structured understanding and generation capabilities, alongside strong cross-scenario robustness. All code and data are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of structured Chinese datasets for LLM evaluation
Solves English-centric bias in existing benchmarks for Chinese models
Enables comprehensive assessment of Chinese linguistic characteristics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs large-scale Chinese data-text pair dataset
Enables multi-task fine-tuning for generalization assessment
Provides structured triples for knowledge-driven task evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.