Towards Transparent AI Grading: Semantic Entropy as a Signal for Human-AI Disagreement

๐Ÿ“… 2025-08-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing AI automated scoring systems, while efficient, fail to adequately characterize scoring uncertainty and inter-annotator disagreement. To address this, we propose *semantic entropy*โ€”a novel uncertainty metric grounded in the reasoning process rather than the final score. Specifically, we prompt GPT-4 to generate multiple justifications for scoring the same short response; these justifications are clustered via entailment-based semantic similarity, and inter-cluster information entropy is computed to quantify explanatory diversity. Crucially, semantic entropy directly links semantic-level reasoning inconsistency with human scoring disagreementโ€”a first in automated assessment. Empirical validation on the ASAP-SAS dataset demonstrates a statistically significant correlation (p < 0.01) between semantic entropy and actual score discrepancies. Moreover, the metric exhibits robustness across disciplines and task types. By grounding uncertainty in interpretable reasoning patterns, semantic entropy enhances the transparency and trustworthiness of AI grading and establishes a new paradigm for human-AI collaborative decision-making in educational assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Automated grading systems can efficiently score short-answer responses, yet they often fail to indicate when a grading decision is uncertain or potentially contentious. We introduce semantic entropy, a measure of variability across multiple GPT-4-generated explanations for the same student response, as a proxy for human grader disagreement. By clustering rationales via entailment-based similarity and computing entropy over these clusters, we quantify the diversity of justifications without relying on final output scores. We address three research questions: (1) Does semantic entropy align with human grader disagreement? (2) Does it generalize across academic subjects? (3) Is it sensitive to structural task features such as source dependency? Experiments on the ASAP-SAS dataset show that semantic entropy correlates with rater disagreement, varies meaningfully across subjects, and increases in tasks requiring interpretive reasoning. Our findings position semantic entropy as an interpretable uncertainty signal that supports more transparent and trustworthy AI-assisted grading workflows.
Problem

Research questions and friction points this paper is trying to address.

Measuring AI grading uncertainty via semantic entropy
Aligning semantic entropy with human grader disagreement
Assessing semantic entropy across subjects and tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic entropy measures GPT-4 rationale variability
Entailment-based clustering quantifies justification diversity
Semantic entropy signals AI grading uncertainty transparently
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Karrtik Iyer
Thoughtworks AI Research Labs
Manikandan Ravikiran
Manikandan Ravikiran
Thoughtworks AI Research
Machine learningComputer visionNatural Language Processing
P
Prasanna Pendse
Thoughtworks AI Research Labs
S
Shayan Mohanty
Thoughtworks AI Research Labs