🤖 AI Summary
High evaluation costs, low inclusivity, slow innovation, and substantial environmental impact hinder large language model (LLM) assessment. To address these challenges, this paper proposes an efficient sample selection method grounded in inter-model response disagreement. Unlike conventional subset selection approaches that rely on clustering-based anchor points, our method leverages prediction inconsistency across models as the primary criterion and employs a lightweight, per-sample greedy algorithm to select the top-k most disagreeing instances—eliminating the need for computationally expensive global clustering. Theoretically, the approach achieves information-theoretic optimality. Evaluated on MMLU, HellaSwag, Winogrande, and ARC benchmarks, it substantially outperforms existing compression methods: using only 5%–10% of the full dataset, it attains over 95% accuracy in predicting final model performance. This yields significant gains in evaluation efficiency, robustness, and scalability.
📝 Abstract
Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $ extit{maximise diversity in model responses}$. Our method, $ extbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $ extbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.