ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

📅 2024-12-09

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Traditional static test sets inadequately evaluate foundation models’ diverse capabilities in open-ended scenarios. To address this, we propose ONEBench—a dynamic, extensible benchmarking paradigm that enables on-demand generation of customized evaluation suites targeting open capabilities, framing model assessment as a collective selection and aggregation process over sample-level tests. Our key contributions include: (1) the first unified, open-ended, and evolvable evaluation framework operating at the sample level; (2) a sparse measurement aggregation algorithm, a progressive sample pool construction mechanism, and a cross-modal unified interface (ONEBench-LLM/LMM); and (3) a robustness-aware scoring model with theoretical guarantees on identifiability and fast convergence. Experiments show that ONEBench achieves ranking stability >0.98 under 95% measurement sparsity, reduces evaluation cost by 20×, and attains >0.98 correlation with mean-score rankings on homogeneous data—enabling unified, efficient, and reliable assessment of both language and multimodal models.

Technology Category

Application Category

📝 Abstract

Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating open-ended capabilities of foundation models

Aggregating diverse metrics into reliable model scores

Reducing evaluation cost while maintaining accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified sample pool for custom benchmarks

Algorithms for sparse measurement aggregation

Robust ranking with missing data

🔎 Similar Papers

No similar papers found.

Apple

Cupertino, United States of America

Research Scientist Intern, Multimodal AI (PhD)