Exploring Learning Complexity for Efficient Downstream Dataset Pruning

📅 2024-02-08
📈 Citations: 1
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
To address the high redundancy in datasets for large model fine-tuning and the computational infeasibility of existing pruning methods—which rely on full-model training—this paper proposes Distorting-based Learning Complexity (DLC), a training-free hardness scoring framework. The method introduces: (1) the first training-free algorithm for estimating learning complexity via weight masking, and (2) FlexRand, a randomized undersampling strategy designed to mitigate distributional shift in pruned subsets. Evaluated on image pruning tasks, DLC achieves a 35× speedup over conventional approaches while attaining state-of-the-art accuracy. Moreover, it generalizes effectively to both image and instruction-tuning multimodal data. By eliminating the need for iterative training during subset selection, DLC unifies efficiency, cross-modal generalizability, and practical deployability—marking a significant advance in scalable, data-efficient model adaptation.

Technology Category

Application Category

📝 Abstract
The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
Problem

Research questions and friction points this paper is trying to address.

Reducing dataset size while maintaining task performance
Avoiding costly training for large-scale pre-trained models
Identifying informative samples efficiently without full training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free hardness score DLC for dataset pruning
Lightweight weights masking for fast complexity estimation
FlexRand under-sampling to reduce subset distribution shift
🔎 Similar Papers
No similar papers found.
W
Wenyu Jiang
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China; Department of Computer Science and Technology, Nanjing University, Nanjing, China
Z
Zhenlong Liu
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China
Z
Zejian Xie
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China
S
Songxin Zhang
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China
Bingyi Jing
Bingyi Jing
Chair Professor, Southern University of Science & Technology
StatisticsData ScienceAI
Hongxin Wei
Hongxin Wei
Southern University of Science and Technology (SUSTech)
Reliable Machine LearningUncertainty EstimationStatistics