LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

📅 2024-12-31

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the challenge of modeling human rater diversity in automated text evaluation for human–AI interaction. Methodologically, it introduces an interpretable, multi-dimensional LLM-driven evaluation framework: a multi-dimensional human rating scale guides the LLM to generate probabilistic response distributions, and a lightweight calibration network jointly models rater-specific variability—via judge-specific parameters—enabling high-fidelity prediction of both fine-grained and holistic scores without additional human annotations. Its key contribution is the first explicit modeling of LLM output distributions as rater-specific functions. Evaluated on a human–AI dialogue retrieval task using a 9-dimensional rating scale, the method achieves a root-mean-square error < 0.5 for user satisfaction (1–4 scale), outperforming the uncalibrated baseline by a factor of two.

Technology Category

Application Category

📝 Abstract

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $ extit{combined}$ to $ extit{predict}$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9 questions (assessing dimensions such as naturalness, conciseness, and citation quality) predicts human judges' assessment of overall user satisfaction, on a scale of 1--4, with RMS error $<0.5$, a $2 imes$ improvement over the uncalibrated baseline.

Problem

Research questions and friction points this paper is trying to address.

Automatic Text Quality Evaluation

Human-like Judgement

User Satisfaction Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-Rubric

Multi-dimensional Evaluation

Enhanced User Satisfaction Prediction

🔎 Similar Papers

No similar papers found.