Exploring Learning Complexity for Efficient Downstream Dataset Pruning

📅 2024-02-08

📈 Citations: 1

✨ Influential: 0

career value

214K/year

🤖 AI Summary

To address the high redundancy in datasets for large model fine-tuning and the computational infeasibility of existing pruning methods—which rely on full-model training—this paper proposes Distorting-based Learning Complexity (DLC), a training-free hardness scoring framework. The method introduces: (1) the first training-free algorithm for estimating learning complexity via weight masking, and (2) FlexRand, a randomized undersampling strategy designed to mitigate distributional shift in pruned subsets. Evaluated on image pruning tasks, DLC achieves a 35× speedup over conventional approaches while attaining state-of-the-art accuracy. Moreover, it generalizes effectively to both image and instruction-tuning multimodal data. By eliminating the need for iterative training during subset selection, DLC unifies efficiency, cross-modal generalizability, and practical deployability—marking a significant advance in scalable, data-efficient model adaptation.

Technology Category

Application Category

📝 Abstract

The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.

Problem

Research questions and friction points this paper is trying to address.

Reducing dataset size while maintaining task performance

Avoiding costly training for large-scale pre-trained models

Identifying informative samples efficiently without full training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free hardness score DLC for dataset pruning

Lightweight weights masking for fast complexity estimation

FlexRand under-sampling to reduce subset distribution shift

🔎 Similar Papers

No similar papers found.