Aligning Black-box Language Models with Human Judgments

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This study addresses the challenge of improving alignment between black-box large language model (LLM) automatic evaluations and human judgments—critical for human-centered applications such as recommendation and search, and essential to mitigate inter-annotator variability and bias in human evaluation. We propose a fine-tuning-free linear calibration framework that enables fine-grained, zero-shot or few-shot alignment with individual and aggregated human annotators—first of its kind for black-box LLMs. Our method learns linear mappings from LLM outputs via few-shot regression calibration. Evaluated across 29 diverse tasks, it achieves an average 142% improvement in human–LLM agreement; notably, on 4 out of 6 benchmark tasks, calibrated LLMs surpass inter-human agreement. Moreover, small-scale models, after calibration, attain assessment performance comparable to that of large models, substantially reducing inference cost and deployment overhead.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM's outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.

Problem

Research questions and friction points this paper is trying to address.

Align LLM judgments with human evaluators

Improve agreement without retraining LLMs

Enable smaller LLMs to match larger models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear mapping for alignment

Zero-shot and few-shot settings

Improves small LLM performance

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks