Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study investigates whether standard NLP benchmarks can serve as cost-effective proxies for expensive human preference evaluations in dialogue model assessment. Method: We conduct a systematic statistical analysis across 160 canonical benchmarks (e.g., MMLU, ARC) and over 13,000 human-annotated dialogues, examining correlations with multi-dimensional human judgments—including helpfulness, safety, and honesty—and propose an over-parameterized linear regression framework to enable accurate, scale-invariant prediction of human preferences. Contribution/Results: We find that most benchmarks exhibit strong positive correlations with human preferences—except for safety and honesty metrics, which show significant negative correlations. Our method achieves high-fidelity prediction: it substitutes >70% of human evaluations, with mean absolute prediction error <0.08 and Pearson correlation r > 0.72 across both single-turn and multi-turn dialogues. This work establishes a low-cost, high-accuracy paradigm for evaluating conversational AI systems.

Technology Category

Application Category

📝 Abstract

The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.

Problem

Research questions and friction points this paper is trying to address.

Correlate NLP benchmarks with human evaluations

Predict human preferences using automated metrics

Reduce reliance on costly human annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

NLP benchmarks predict human preferences

Overparameterized linear regressions enhance predictions

Correlation studies reduce human evaluation costs

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks