Disentangling the Roles of Representation and Selection in Data Pruning

📅 2025-07-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing data pruning methods suffer from complex designs and poorly understood mechanisms of their key components, hindering progress in the field. Method: This paper pioneers a decoupled framework that separates data pruning into two orthogonal modules—“data representation” and “selection algorithm”—and systematically evaluates their independent contributions to instance selection efficacy in NLP model training. Through theoretical analysis and extensive empirical evaluation across multiple benchmarks (including gradient- and embedding-based representations and diverse selection algorithms), we assess their relative impact. Contribution/Results: We find that representation quality dominates selection algorithm choice: high-fidelity representations (e.g., training gradients) substantially improve pruning effectiveness, whereas no selection algorithm exhibits consistent superiority across tasks; even for identical objectives, different algorithms yield markedly divergent selected subsets. This work establishes an interpretable, reusable analytical framework for data pruning and identifies representation optimization—not algorithmic refinement—as the primary lever for enhancing pruning efficiency.

Technology Category

Application Category

📝 Abstract
Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.
Problem

Research questions and friction points this paper is trying to address.

Analyzing data representation and selection in pruning
Evaluating impact of representations on instance selection
Comparing performance of different selection algorithms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decompose data pruning into representation and selection
Analyze influence of representations on instance selection
Evaluate selection algorithms across different settings
Y
Yupei Du
Utrecht University, The Netherlands
Y
Yingjin Song
Utrecht University, The Netherlands
H
Hugh Mee Wong
Utrecht University, The Netherlands
D
Daniil Ignatev
Utrecht University, The Netherlands
Albert Gatt
Albert Gatt
Professor of Natural Language Generation, Utrecht University
Computational LinguisticsNatural Language GenerationVision and LanguageLanguage Production
D
Dong Nguyen
Utrecht University, The Netherlands