Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated scoring systems in large-scale educational assessments suffer from a critical lack of transparency and interpretability. Method: This paper proposes FGTI—four principled interpretability criteria (Faithfulness, Groundedness, Traceability, Interchangeability)—to establish the first systematic theoretical framework for explainable automated scoring, and releases the benchmark model AnalyticScore. It leverages large language models to extract semantically explicit, human-identifiable features from student responses, constructs human-understandable feature vectors, and applies ordinal logistic regression for interpretable score prediction. Results: On the ASAP-SAS 10-task dataset, AnalyticScore achieves a mean quadratic weighted kappa (QWK) only 0.06 lower than the best black-box model, while its feature extraction exhibits strong agreement with human annotations (Spearman’s ρ > 0.92), significantly enhancing the trustworthiness and pedagogical validity of AI-based scoring.

Technology Category

Application Category

📝 Abstract
AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability -- Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
Problem

Research questions and friction points this paper is trying to address.

Developing interpretable automated scoring for large-scale educational assessments
Addressing transparency needs through Faithfulness, Groundedness, Traceability, and Interchangeability principles
Creating accurate automated scoring that aligns with human evaluation standards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops FGTI principles for interpretable scoring
Uses LLMs to create human-interpretable response features
Applies ordinal logistic regression for transparent scoring
🔎 Similar Papers
No similar papers found.
Y
Yunsung Kim
Stanford University
M
Mike Hardy
Stanford University
J
Joseph Tey
Stanford University
C
Candace Thille
Stanford University
Chris Piech
Chris Piech
Assistant Professor, Stanford University
Algorithms for EducationArtificial IntelligenceLearning at Scale