Skewed Score: A statistical framework to assess autograders

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While large language model (LLM)-based automatic scorers offer scalability, their reliability is low and bias sources are multifaceted—spanning response type, domain, and scoring methodology—necessitating joint assessment of their internal reliability and bias. Method: We propose a Bayesian generalized linear model (GLM) framework that jointly models rater characteristics (e.g., LLM architecture) and response attributes (e.g., length, generating model) to simultaneously quantify scoring system performance, systematic biases, and uncertainty. Contribution/Results: Unlike conventional reliability metrics, our approach enhances interpretability, enables fine-grained bias attribution, and yields posterior probability estimates for bias differences. Empirical evaluation across multi-dimensional scoring tasks demonstrates its effectiveness in bias detection and reliability calibration.

Technology Category

Application Category

📝 Abstract
The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, and other factors. In this paper we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while also addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional reliability metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying the source of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
Problem

Research questions and friction points this paper is trying to address.

Assess reliability and biases in LLM-based autograders
Quantify scoring differences between human and autograders
Enhance interpretability of autograder performance and biases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian GLMs for autograder assessment
Quantifies scoring differences and biases
Augments reliability metrics with uncertainty
🔎 Similar Papers
No similar papers found.