π€ AI Summary
This work addresses the limitations of existing large language model (LLM)-based automated scoring systems, which rely on manually crafted rubrics that are costly to develop and difficult to generalize. The authors introduce the novel concept of βlearnable evaluation skills,β formalizing scoring competence as learnable procedural knowledge expressed in natural language. They propose an iterative optimization framework that integrates fixed scaffolds with learnable rules, leveraging LLM-driven error diagnosis and a verification gating mechanism to automatically construct and transfer scoring rubrics across tasks. Experimental results on all ten prompts of the ASAP-SAS dataset demonstrate that the method significantly outperforms baseline approaches and, in most scenarios, surpasses human expert-designed rubrics, while exhibiting both generalizability and prompt-specific transfer capabilities.
π Abstract
LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.