🤖 AI Summary
To address the budget-constrained sample selection problem for multilayer perceptrons (MLPs) under massive, multi-source, and heterogeneous data, this paper proposes the Hierarchical Value Contribution (DVC) method. DVC introduces the first hierarchical data valuation framework that jointly models Layer-wise Value Contribution (LVC) — quantifying the utility of hidden-layer intermediate representations — and Global Value Contribution (GVC) — measuring global semantic consistency — thereby integrating data quality, relevance, and distributional diversity. It further incorporates Upper Confidence Bound (UCB) to enable adaptive data source selection. By modeling dynamic parameter evolution and employing three-level granularity metrics, DVC achieves sublinear algorithmic complexity. Extensive experiments across six benchmark datasets and eight baselines demonstrate consistent improvements in accuracy and F1-score under varying budget constraints, outperforming state-of-the-art methods. The approach is theoretically grounded and highly scalable.
📝 Abstract
Data selection is one of the fundamental problems in neural network training, particularly for multi-layer perceptrons (MLPs) where identifying the most valuable training samples from massive, multi-source, and heterogeneous data sources under budget constraints poses significant challenges. Existing data selection methods, including coreset construction, data Shapley values, and influence functions, suffer from critical limitations: they oversimplify nonlinear transformations, ignore informative intermediate representations in hidden layers, or fail to scale to larger MLPs due to high computational complexity. In response, we propose DVC (Data Value Contribution), a novel budget-aware method for evaluating and selecting data for MLP training that accounts for the dynamic evolution of network parameters during training. The DVC method decomposes data contribution into Layer Value Contribution (LVC) and Global Value Contribution (GVC), employing six carefully designed metrics and corresponding efficient algorithms to capture data characteristics across three dimensions--quality, relevance, and distributional diversity--at different granularities. DVC integrates these assessments with an Upper Confidence Bound (UCB) algorithm for adaptive source selection that balances exploration and exploitation. Extensive experiments across six datasets and eight baselines demonstrate that our method consistently outperforms existing approaches under various budget constraints, achieving superior accuracy and F1 scores. Our approach represents the first systematic treatment of hierarchical data evaluation for neural networks, providing both theoretical guarantees and practical advantages for large-scale machine learning systems.