Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation

📅 2024-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key challenges in large language model (LLM) evaluation: high uncertainty in model ranking and the difficulty of reliably characterizing domain-specific capabilities. To this end, we propose a nonparametric contextual ranking framework. Methodologically, we introduce the novel concept of a “confidence graph”—a Hasse diagram that uniformly encodes all statistically significant, confidence-supported orderings; extend the Gaussian multiplier bootstrap to non-i.i.d. empirical processes to enable rigorous upper-bound inference for rankings; and integrate context-aware scoring, compositional reasoning, and multiple hypothesis testing. Experiments on synthetic and real-world medical datasets demonstrate that our framework substantially improves the reliability and interpretability of LLM rankings for domain-specific competencies—particularly clinical reasoning—and provides the first theoretically grounded uncertainty quantification tool for Best-of-N alignment strategies.

Technology Category

Application Category

📝 Abstract
We consider the inference for the ranking of large language models (LLMs). Alignment arises as a significant challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has proven to be an effective tool to improve alignment based on the best-of-$N$ policy. In this paper, we propose a new inferential framework for hypothesis testing among the ranking for language models. Our framework is based on a nonparametric contextual ranking framework designed to assess large language models' domain-specific expertise, leveraging nonparametric scoring methods to account for their sensitivity to the prompts. To characterize the combinatorial complexity of the ranking, we introduce a novel concept of confidence diagram, which leverages a Hasse diagram to represent the entire confidence set of rankings by a single directed graph. We show the validity of the proposed confidence diagram by advancing the Gaussian multiplier bootstrap theory to accommodate the supremum of independent empirical processes that are not necessarily identically distributed. Extensive numerical experiments conducted on both synthetic and real data demonstrate that our approach offers valuable insight into the evaluation for the performance of different LLMs across various medical domains.
Problem

Research questions and friction points this paper is trying to address.

Inference for ranking large language models
Mitigate hallucinations using nonparametric ranking
Evaluate LLMs' domain-specific expertise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nonparametric contextual ranking framework
Confidence diagram using Hasse diagram
Gaussian multiplier bootstrap theory
🔎 Similar Papers
No similar papers found.
Z
Zebin Wang
Department of Biostatistics, Harvard Chan School of Public Health
Y
Yi Han
Department of Statistics, Columbia University
Ethan X. Fang
Ethan X. Fang
Associate Professor at Duke University
StatisticsBiostatisticsOptimization
L
Lan Wang
Department of Management Science, University of Miami
J
Junwei Lu
Department of Biostatistics, Harvard Chan School of Public Health