🤖 AI Summary
Evaluating the performance of large language model (LLM) ensembles—particularly discriminators—under scarce labeled data remains challenging. Method: We propose a cost-effective, accurate, and theoretically grounded maximum a posteriori (MAP) estimation framework. It introduces a novel Beta-Binomial hierarchical model to characterize judgment uncertainty, integrates a conformal prediction–driven iterative adaptive sampling termination mechanism, and leverages cross-dataset Bayesian prior transfer to enhance few-shot generalization. Results: On the TruthfulQA benchmark, our method achieves discriminator accuracy estimation error ≤3.37% using only 10 labeled samples—substantially outperforming existing approaches. The framework combines statistical rigor with engineering practicality, establishing a new paradigm for reliable LLM evaluation in low-resource settings.
📝 Abstract
LLM ensembles are widely used for LLM judges. However, how to estimate their accuracy, especially in an efficient way, is unknown. In this paper, we present a principled maximum a posteriori (MAP) framework for an economical and precise estimation of the performance of LLM ensemble judgment. We first propose a mixture of Beta-Binomial distributions to model the judgment distribution, revising from the vanilla Binomial distribution. Next, we introduce a conformal prediction-driven approach that enables adaptive stopping during iterative sampling to balance accuracy with efficiency. Furthermore, we design a prior transfer mechanism that utilizes learned distributions on open-source datasets to improve estimation on a target dataset when only scarce annotations are available. Finally, we present BetaConform, a framework that integrates our distribution assumption, adaptive stopping, and the prior transfer mechanism to deliver a theoretically guaranteed distribution estimation of LLM ensemble judgment with minimum labeled samples. BetaConform is also validated empirically. For instance, with only 10 samples from the TruthfulQA dataset, for a Llama ensembled judge, BetaConform gauges its performance with error margin as small as 3.37%.