Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Addressing the challenge of efficient data pruning in cross-dataset fine-tuning—complicated by disparities in dataset scale, distributional shift, and label space inconsistency—this paper proposes the first lightweight, reference-model-free, and full-training-free cross-dataset pruning framework. Methodologically, it integrates TF-IDF-based text embeddings with geometric median-based importance estimation, and introduces a distance-driven hierarchical sampling strategy coupled with dataset-scale-adaptive pruning. Experiments across six heterogeneous natural language understanding (NLU) datasets demonstrate that pruned models incur an average accuracy drop of less than 0.8%, achieve a 2.3× speedup in training, and significantly reduce computational overhead—while preserving or even improving performance. The core contribution lies in establishing the first dependency-free, low-overhead, high-fidelity cross-dataset data pruning paradigm.

Technology Category

Application Category

📝 Abstract

Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning

Problem

Research questions and friction points this paper is trying to address.

Efficient Training

Natural Language Understanding

Resource Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swift Cross-Dataset Pruning

TF-IDF

Hierarchical Pruning

🔎 Similar Papers

No similar papers found.