Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the performance saturation in large language model (LLM) ensembles caused by high inter-model correlation, which limits reliability gains under a constrained query budget. The authors formulate ensemble selection as an optimization problem aimed at maximizing the mutual information between the predictions of selected models and the true labels. To capture error dependencies among models, they employ a Gaussian Copula to model their correlation structure and propose a greedy mutual information-based selection algorithm. This approach provides the first information-theoretic characterization of the fundamental error lower bound underlying LLM ensemble saturation and yields a practical strategy for estimating and leveraging mutual information directly from data. Empirical evaluations on MEDMCQA, MMLU, and IMDB benchmarks demonstrate that the method consistently outperforms strong baselines under identical query budgets.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated. This raises a fundamental question: which models should be selected when forming an LLM ensemble? We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models. Furthermore, to explain why performance can saturate even with many models, we model the correlated errors of the models using Gaussian-copula and show an information-theoretic error floor for the performance of the ensemble. Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data and iteratively builds an ensemble under a query budget. We test our approach in two question answering datasets and one binary sentiment classification dataset: MEDMCQA, MMLU, and IMDB movie reviews. Across all datasets, we observe that our method consistently outperforms strong baselines under the same query budget.

Problem

Research questions and friction points this paper is trying to address.

LLM ensemble selection

model correlation

query budget

performance saturation

mutual information

Innovation

Methods, ideas, or system contributions that make the work stand out.

mutual information

LLM ensemble selection

Gaussian copula