F`ux`i: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face significant challenges in classical Chinese text understanding and generation due to linguistic specificity, rigid structural constraints (e.g., prosody, parallelism), and deep cultural embedding; existing benchmarks emphasize comprehension but lack rigorous generative evaluation. Method: We introduce the first comprehensive classical Chinese benchmark jointly assessing understanding and generation across 21 tasks—including novel generative tasks such as classical poetry composition and couplet completion—and propose a “comprehension-generation balanced” evaluation framework. Our hybrid evaluation metric integrates rule-based validation (e.g., tonal patterns, allusion accuracy, syntactic correctness) with fine-tuned LLM-based judgment. We further pioneer the incorporation of a “cultural authenticity” dimension. Results: Experiments reveal that state-of-the-art models achieve strong comprehension performance but exhibit substantial generative deficits, particularly on high-cultural-density and strict-form tasks, where accuracy drops below 40%.

Technology Category

Application Category

📝 Abstract
Ancient Chinese text processing presents unique challenges for large language models (LLMs) due to its distinct linguistic features, complex structural constraints, and rich cultural context. While existing benchmarks have primarily focused on evaluating comprehension through multiple-choice questions, there remains a critical gap in assessing models' generative capabilities in classical Chinese. We introduce F`ux`i, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. Our benchmark distinguishes itself through three key contributions: (1) balanced coverage of both comprehension and generation tasks, including novel tasks like poetry composition and couplet completion, (2) specialized evaluation metrics designed specifically for classical Chinese text generation, combining rule-based verification with fine-tuned LLM evaluators, and (3) a systematic assessment framework that considers both linguistic accuracy and cultural authenticity. Through extensive evaluation of state-of-the-art LLMs, we reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks, particularly those requiring deep cultural knowledge and adherence to classical formats. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development. The benchmark, evaluation toolkit, and baseline results are publicly available to facilitate research in this domain.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on ancient Chinese text understanding and generation.
Assessing generative capabilities in classical Chinese with novel tasks.
Developing specialized metrics for classical Chinese text generation evaluation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Balanced comprehension and generation tasks
Specialized metrics for classical Chinese
Systematic linguistic and cultural assessment
🔎 Similar Papers
No similar papers found.
Shangqing Zhao
Shangqing Zhao
University of Oklahoma
wireless networkingnetwork securitywireless communication
Y
Yuhao Zhou
School of Computer Science and Technology, East China Normal University
Yupei Ren
Yupei Ren
East China Normal University
Argument MiningIntelligent Education
Z
Zhe Chen
School of Computer Science and Technology, East China Normal University
C
Chenghao Jia
School of Computer Science and Technology, East China Normal University
F
Fang Zhe
School of Computer Science and Technology, East China Normal University
Z
Zhaogaung Long
School of Computer Science and Technology, East China Normal University
S
Shu Liu
School of Computer Science and Technology, East China Normal University; Lab of Artificial Intelligence for Education, East China Normal University; Shanghai Institute of Artificial Intelligence for Education, East China Normal University
Man Lan
Man Lan
East China Normal University,School of Computer Science and Technology
NLP