Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative Optimization

πŸ“… 2026-05-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing large language model (LLM)-based automated scoring systems, which rely on manually crafted rubrics that are costly to develop and difficult to generalize. The authors introduce the novel concept of β€œlearnable evaluation skills,” formalizing scoring competence as learnable procedural knowledge expressed in natural language. They propose an iterative optimization framework that integrates fixed scaffolds with learnable rules, leveraging LLM-driven error diagnosis and a verification gating mechanism to automatically construct and transfer scoring rubrics across tasks. Experimental results on all ten prompts of the ASAP-SAS dataset demonstrate that the method significantly outperforms baseline approaches and, in most scenarios, surpasses human expert-designed rubrics, while exhibiting both generalizability and prompt-specific transfer capabilities.
πŸ“ Abstract
LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.
Problem

Research questions and friction points this paper is trying to address.

automated scoring
rubric construction
large language models
assessment skills
scaling bottleneck
Innovation

Methods, ideas, or system contributions that make the work stand out.

learnable assessment skills
rubric construction
iterative optimization
LLM-based automated scoring
item-agnostic rules