EssayBench: Evaluating Large Language Models in Multi-Genre Chinese Essay Writing

📅 2025-06-03

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing Chinese essay generation evaluation relies predominantly on coarse-grained metrics, overlooking genre-specific conventions and structural-rhetorical complexity. To address this, we introduce the first fine-grained Chinese essay evaluation benchmark covering four major genres—argumentative, narrative, descriptive, and expository—comprising 728 authentic prompts spanning open-ended and constrained scenarios. We propose a genre-specific hierarchical scoring framework, manually annotating and modeling essays along three dimensions: structure, rhetoric, and content. Furthermore, we pioneer a human-AI collaborative verification mechanism and a high-consistency human evaluation protocol. Systematic evaluation of 15 large language models reveals critical insights into their genre adaptability and instruction-following boundaries. The benchmark is fully reproducible and establishes a novel, empirically grounded paradigm for Chinese essay generation and assessment in educational contexts.

Technology Category

Application Category

📝 Abstract

Chinese essay writing and its evaluation are critical in educational contexts, yet the capabilities of Large Language Models (LLMs) in this domain remain largely underexplored. Existing benchmarks often rely on coarse-grained text quality metrics, largely overlooking the structural and rhetorical complexities of Chinese essays, particularly across diverse genres. To address this gap, we propose enchName, a multi-genre benchmark specifically designed for Chinese essay writing across four major genres: Argumentative, Narrative, Descriptive, and Expository. We curate and refine a total of 728 real-world prompts to ensure authenticity and meticulously categorize them into the extit{Open-Ended} and extit{Constrained} sets to capture diverse writing scenarios. To reliably evaluate generated essays, we develop a fine-grained, genre-specific scoring framework that hierarchically aggregates scores. We further validate our evaluation protocol through a comprehensive human agreement study. Finally, we benchmark 15 large-sized LLMs, analyzing their strengths and limitations across genres and instruction types. With enchName, we aim to advance LLM-based Chinese essay evaluation and inspire future research on improving essay generation in educational settings.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs in multi-genre Chinese essay writing

Lack of fine-grained metrics for structural complexities

Need for genre-specific scoring framework in education

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-genre benchmark for Chinese essays

Fine-grained genre-specific scoring framework

Comprehensive human agreement validation

🔎 Similar Papers

No similar papers found.