Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Evaluating the performance of large language model (LLM) ensembles—particularly discriminators—under scarce labeled data remains challenging. Method: We propose a cost-effective, accurate, and theoretically grounded maximum a posteriori (MAP) estimation framework. It introduces a novel Beta-Binomial hierarchical model to characterize judgment uncertainty, integrates a conformal prediction–driven iterative adaptive sampling termination mechanism, and leverages cross-dataset Bayesian prior transfer to enhance few-shot generalization. Results: On the TruthfulQA benchmark, our method achieves discriminator accuracy estimation error ≤3.37% using only 10 labeled samples—substantially outperforming existing approaches. The framework combines statistical rigor with engineering practicality, establishing a new paradigm for reliable LLM evaluation in low-resource settings.

Technology Category

Application Category

📝 Abstract

LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.

Problem

Research questions and friction points this paper is trying to address.

Estimate LLM ensemble judgment accuracy efficiently

Balance accuracy and efficiency in iterative sampling

Improve estimation with scarce annotations via prior transfer

Innovation

Methods, ideas, or system contributions that make the work stand out.

MAP framework for LLM ensemble judgment estimation

Beta-Binomial mixture model for judgment distribution

Prior transfer mechanism for scarce annotations

🔎 Similar Papers

LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks