Comparison of Scoring Rationales Between Large Language Models and Human Raters

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This study investigates the heterogeneity between rationales generated by large language models (LLMs) and human raters in essay scoring to uncover cognitive and inferential roots of scoring inconsistency. Method: Leveraging state-of-the-art LLMs—including GPT-4o and Gemini—we conduct the first semantic-level clustering and explainability comparison of human and LLM-generated scoring rationales, using multidimensional metrics: cosine similarity, principal component analysis, quadratic weighted Kappa, and normalized mutual information. Contribution/Results: We identify significant disparities between humans and LLMs in reasoning pathways, evidence citation, and evaluation dimensions. Yet, certain models achieve human-level accuracy and internal consistency within their rationales. This work provides empirical grounding and methodological frameworks for understanding the reasoning mechanisms underlying automated essay scoring, enhancing model interpretability, and advancing human-AI collaborative assessment paradigms.

Technology Category

Application Category

📝 Abstract

Advances in automated scoring are closely aligned with advances in machine-learning and natural-language-processing techniques. With recent progress in large language models (LLMs), the use of ChatGPT, Gemini, Claude, and other generative-AI chatbots for automated scoring has been explored. Given their strong reasoning capabilities, LLMs can also produce rationales to support the scores they assign. Thus, evaluating the rationales provided by both human and LLM raters can help improve the understanding of the reasoning that each type of rater applies when assigning a score. This study investigates the rationales of human and LLM raters to identify potential causes of scoring inconsistency. Using essays from a large-scale test, the scoring accuracy of GPT-4o, Gemini, and other LLMs is examined based on quadratic weighted kappa and normalized mutual information. Cosine similarity is used to evaluate the similarity of the rationales provided. In addition, clustering patterns in rationales are explored using principal component analysis based on the embeddings of the rationales. The findings of this study provide insights into the accuracy and ``thinking'' of LLMs in automated scoring, helping to improve the understanding of the rationales behind both human scoring and LLM-based automated scoring.

Problem

Research questions and friction points this paper is trying to address.

Compares scoring rationales between LLMs and human raters

Investigates causes of scoring inconsistencies in automated evaluation

Analyzes rationale similarity and clustering patterns using embedding techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating human and LLM rationales using cosine similarity

Analyzing rationale clustering patterns via principal component analysis

Comparing scoring accuracy with quadratic weighted kappa metrics

🔎 Similar Papers

No similar papers found.