How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

📅 2026-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of quantifying dataset value and efficiently optimizing data selection to enhance model performance. By unifying neural scaling laws and the Vendi Score within a submodular function framework, the authors propose a class of matrix spectral functions—derived from weakly matrix-monotone functions—as a general-purpose objective for data valuation, and establish their weak submodularity. Building on this theoretical foundation, they design a highly efficient optimization algorithm that integrates secant-based eigenvalue updates with a greedy strategy operating on large-scale embeddings, achieving approximately 35,000× speedup on benchmarks such as ImageNet-1K. Experimental results demonstrate that the facility location objective yields optimal performance, and reveal that data value depends not only on dataset size, class balance, or budget, but also on more nuanced structural properties.
📝 Abstract
Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.
Problem

Research questions and friction points this paper is trying to address.

data valuation
neural scaling laws
Vendi Score
submodular functions
dataset selection
Innovation

Methods, ideas, or system contributions that make the work stand out.

matrix spectral functions
submodularity
Vendi Score
secular equation optimization
data valuation
🔎 Similar Papers
No similar papers found.