Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

📅 2025-10-17

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This study addresses evaluation bias in large language models (LLMs) for Tang poetry generation. We propose a three-stage, multidimensional assessment framework integrating computational metrics, LLM-as-a-judge evaluations, and human expert reviews—systematically analyzing thematic coherence, affective resonance, imagistic richness, metrical fidelity, and stylistic authenticity. Our findings reveal a pervasive “echo chamber” effect: LLMs consistently overestimate their own performance, particularly regarding cultural depth and aesthetic creativity, exhibiting significant divergence from human scholarly judgment. To mitigate this, we introduce a novel human–AI hybrid verification mechanism, asserting that technical metrics must be grounded in humanistic standards. Experiments across multiple state-of-the-art LLMs demonstrate that sole reliance on model self-evaluation misleads poetry quality assessment. The results underscore the necessity of interdisciplinary, collaborative evaluation paradigms for culturally grounded generative tasks.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit "echo chamber" effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' generation of classical Chinese poetry quality

Identifying systematic biases in LLM poetry assessment methods

Assessing divergence between AI and human creative judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Three-step evaluation framework combining metrics and validation

Assesses poetic quality across themes, emotions, and style

Reveals LLM echo chamber effects in creative evaluation

🔎 Similar Papers

Large Language Models for Classical Chinese Poetry Translation: Benchmarking, Evaluating, and Improving