DISCO: Diversifying Sample Condensation for Efficient Model Evaluation

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

High evaluation costs, low inclusivity, slow innovation, and substantial environmental impact hinder large language model (LLM) assessment. To address these challenges, this paper proposes an efficient sample selection method grounded in inter-model response disagreement. Unlike conventional subset selection approaches that rely on clustering-based anchor points, our method leverages prediction inconsistency across models as the primary criterion and employs a lightweight, per-sample greedy algorithm to select the top-k most disagreeing instances—eliminating the need for computationally expensive global clustering. Theoretically, the approach achieves information-theoretic optimality. Evaluated on MMLU, HellaSwag, Winogrande, and ARC benchmarks, it substantially outperforms existing compression methods: using only 5%–10% of the full dataset, it attains over 95% accuracy in predicting final model performance. This yields significant gains in evaluation efficiency, robustness, and scalability.

Technology Category

Application Category

📝 Abstract

Evaluating modern machine learning models has become prohibitively expensive. Benchmarks such as LMMs-Eval and HELM demand thousands of GPU hours per model. Costly evaluation reduces inclusivity, slows the cycle of innovation, and worsens environmental impact. The typical approach follows two steps. First, select an anchor subset of data. Second, train a mapping from the accuracy on this subset to the final test result. The drawback is that anchor selection depends on clustering, which can be complex and sensitive to design choices. We argue that promoting diversity among samples is not essential; what matters is to select samples that $ extit{maximise diversity in model responses}$. Our method, $ extbf{Diversifying Sample Condensation (DISCO)}$, selects the top-k samples with the greatest model disagreements. This uses greedy, sample-wise statistics rather than global clustering. The approach is conceptually simpler. From a theoretical view, inter-model disagreement provides an information-theoretically optimal rule for such greedy selection. $ extbf{DISCO}$ shows empirical gains over prior methods, achieving state-of-the-art results in performance prediction across MMLU, Hellaswag, Winogrande, and ARC. Code is available here: https://github.com/arubique/disco-public.

Problem

Research questions and friction points this paper is trying to address.

Reduces costly model evaluation using condensed samples

Selects samples maximizing model response diversity

Improves performance prediction accuracy across benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selects samples maximizing model response diversity

Uses greedy sample-wise statistics for selection

Achieves state-of-the-art performance prediction results

🔎 Similar Papers

No similar papers found.