Nonparametric LLM Evaluation from Preference Data

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes DMLEval, a nonparametric statistical framework for robust and efficient large language model (LLM) evaluation based on debiased machine learning (DML). Addressing the limitations of existing methods—which often rely on strong parametric assumptions or lack proper uncertainty quantification—DMLEval introduces the Generalized Average Rank Score (GARS) to unify models such as Bradley-Terry and PageRank without requiring parametric constraints. The framework accommodates complex preference structures (including ties), is compatible with black-box models and LLM-as-a-judge evaluators, and provides guidance for optimal preference data collection. Experiments on both synthetic and real-world datasets demonstrate that DMLEval significantly outperforms current approaches in ranking accuracy, statistical efficiency, and data utilization, offering a flexible and reliable tool for LLM assessment.

Technology Category

Application Category

📝 Abstract
Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, DMLEval, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLEval comes with the following advantages: (i) It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs.
Problem

Research questions and friction points this paper is trying to address.

LLM evaluation
preference data
nonparametric statistics
ranking
uncertainty quantification
Innovation

Methods, ideas, or system contributions that make the work stand out.

nonparametric evaluation
debiased machine learning
LLM ranking
preference data
generalized average ranking scores
🔎 Similar Papers
No similar papers found.