Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use

πŸ“… 2025-06-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing benchmarks for evaluating large language models’ (LLMs) understanding and usage of Chinese chengyu (idioms) suffer from narrow task coverage and limited scope. To address this, we propose Chengyu-Benchβ€”the first comprehensive evaluation benchmark dedicated to chengyu comprehension and application. It comprises three tasks: sentiment polarity classification, contextual appropriateness detection, and open-ended cloze generation, covering 1,765 high-frequency chengyu. Our contributions include: (1) moving beyond conventional multiple-choice formats by incorporating context error identification and generative cloze completion, enabling multi-dimensional, application-oriented assessment; (2) curating high-quality, diverse, human-verified corpora; and (3) adopting a hybrid evaluation paradigm integrating classification, detection, and generation. Experimental results reveal that state-of-the-art LLMs achieve strong performance on sentiment classification (>95%), but exhibit substantial deficiencies in contextual appropriateness (β‰ˆ85%) and open cloze (top-1 accuracy β‰ˆ40%), highlighting persistent gaps in deep semantic and cultural understanding of chengyu.

Technology Category

Application Category

πŸ“ Abstract
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to understand Chinese idioms' cultural nuances
Assessing idiom usage appropriateness in contextual scenarios
Testing open cloze performance for idiom comprehension without options
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for Chinese idiom understanding
Three tasks: Connotation, Appropriateness, Open Cloze
Human-verified examples from diverse corpora
πŸ”Ž Similar Papers
No similar papers found.