🤖 AI Summary
In high-label-cost scenarios (e.g., medical imaging), efficiently selecting the most informative subset from large-scale unlabeled data for annotation remains challenging.
Method: We establish a theoretical equivalence between subset selection and neural network pruning in terms of optimization objectives, enabling the transfer of pruning heuristics to data selection. Specifically, we propose a subset scoring criterion based on neural feature norms—requiring no additional training or gradient computation—and demonstrate its compatibility across architectures (ResNet, ViT) and datasets (CIFAR, ImageNet subsets).
Contribution/Results: Our method achieves state-of-the-art performance under few-shot labeling settings, outperforming existing approaches by an average accuracy gain of 1.8–3.2 percentage points. This empirically validates the cross-domain applicability of model compression techniques to data-centric machine learning, highlighting their transferable value for data engineering.
📝 Abstract
Having large amounts of annotated data significantly impacts the effectiveness of deep neural networks. However, the annotation task can be very expensive in some domains, such as medical data. Thus, it is important to select the data to be annotated wisely, which is known as the subset selection problem. We investigate the relationship between subset selection and neural network pruning, which is more widely studied, and establish a correspondence between them. Leveraging insights from network pruning, we propose utilizing the norm criterion of neural network features to improve subset selection methods. We empirically validate our proposed strategy on various networks and datasets, demonstrating enhanced accuracy. This shows the potential of employing pruning tools for subset selection.