Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
This study addresses the lack of systematic statistical analysis on how modifications to scoring rubrics affect agreement between human raters and large language model (LLM)-based automated scoring systems. It presents the first quantitative investigation into the impact of various rubric design choices—including the incorporation of contextual exemplars, adjustments in complexity, and mitigation of position bias—on human–LLM scoring consistency. Drawing on data from automated essay scoring and instruction-following evaluation tasks, the authors employ statistical methods to compare outcomes under holistic versus analytic rubric configurations. Their findings reveal that integrating representative examples and reducing position bias significantly enhance inter-rater consistency, whereas highly complex rubrics and conservative aggregation strategies tend to diminish it. These results offer empirical guidance for optimizing rubric design in automated scoring systems.
📝 Abstract
Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emph{analytic} judgments, which decompose assessment criteria - for example, ``quality'' into ``fluency'' and ``organization''. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires understanding not just the relationship between human and autorater annotations but how that relationship changes as holistic or analytic judgments are elicited. The results indicate that rubric edits providing representative examples and additional context, and reducing positional bias in the rubric increased human-autorater agreement, while higher rubric complexity and conservative aggregation methods tended to decrease it. The findings from the automatic essay scoring and instruction-following evaluation domains suggest that practitioners should carefully analyze domain- and rubric-specific performance to move towards higher human-autorater agreement.
Problem

Research questions and friction points this paper is trying to address.

rubric modification
human-autorater agreement
holistic judgment
analytic judgment
LLM-as-judges
Innovation

Methods, ideas, or system contributions that make the work stand out.

rubric design
human-autorater agreement
LLM-as-judges
analytic vs holistic judgment
automatic scoring