Submodular Benchmark Selection

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

248K/year
🤖 AI Summary
This work addresses the high cost of evaluating large language models across numerous benchmarks, which are often highly correlated, necessitating efficient selection of informative benchmark subsets. The study formalizes this problem for the first time as submodular maximization under a multivariate Gaussian distribution, employing entropy (the log-determinant of the covariance matrix) and mutual information as objective functions for subset selection. Theoretical analysis reveals that entropy maximization is equivalent to principal component Cholesky decomposition and demonstrates that mutual information exhibits empirical monotonicity for small subsets, justifying greedy optimization. Experiments on three correlation matrices derived from ten public leaderboards show that mutual information–guided selection significantly outperforms entropy-based methods in benchmark imputation tasks.
📝 Abstract
Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.
Problem

Research questions and friction points this paper is trying to address.

benchmark selection
submodular optimization
large language models
correlated benchmarks
informative subset
Innovation

Methods, ideas, or system contributions that make the work stand out.

submodular maximization
benchmark selection
mutual information
entropy
large language models