π€ AI Summary
Existing benchmarks for evaluating large language modelsβ (LLMs) understanding and usage of Chinese chengyu (idioms) suffer from narrow task coverage and limited scope. To address this, we propose Chengyu-Benchβthe first comprehensive evaluation benchmark dedicated to chengyu comprehension and application. It comprises three tasks: sentiment polarity classification, contextual appropriateness detection, and open-ended cloze generation, covering 1,765 high-frequency chengyu. Our contributions include: (1) moving beyond conventional multiple-choice formats by incorporating context error identification and generative cloze completion, enabling multi-dimensional, application-oriented assessment; (2) curating high-quality, diverse, human-verified corpora; and (3) adopting a hybrid evaluation paradigm integrating classification, detection, and generation. Experimental results reveal that state-of-the-art LLMs achieve strong performance on sentiment classification (>95%), but exhibit substantial deficiencies in contextual appropriateness (β85%) and open cloze (top-1 accuracy β40%), highlighting persistent gaps in deep semantic and cultural understanding of chengyu.
π Abstract
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.