Most Influential Subset Selection: Challenges, Promises, and Beyond

📅 2024-09-25
🏛️ arXiv.org
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
This paper addresses the problem of quantifying the *collective influence* of training data subsets on model behavior in machine learning, noting that conventional influence functions—relying on additive assumptions—fail to capture higher-order sample interactions in nonlinear models. To this end, we formalize the *Most Influential Subset Selection (MISS)* problem and provide the first theoretical proof that greedy influence-function-based algorithms fundamentally fail in linear regression. We then propose an *adaptive iterative heuristic method* that explicitly models high-order sample interactions, thereby overcoming the additivity limitation. Experiments on real-world datasets demonstrate that our approach significantly improves the accuracy of influential subset identification, generalizes effectively to classification tasks and nonlinear neural networks, and reveals a fundamental trade-off between identification accuracy and computational efficiency in subset selection.

Technology Category

Application Category

📝 Abstract
How can we attribute the behaviors of machine learning models to their training data? While the classic influence function sheds light on the impact of individual samples, it often fails to capture the more complex and pronounced collective influence of a set of samples. To tackle this challenge, we study the Most Influential Subset Selection (MISS) problem, which aims to identify a subset of training samples with the greatest collective influence. We conduct a comprehensive analysis of the prevailing approaches in MISS, elucidating their strengths and weaknesses. Our findings reveal that influence-based greedy heuristics, a dominant class of algorithms in MISS, can provably fail even in linear regression. We delineate the failure modes, including the errors of influence function and the non-additive structure of the collective influence. Conversely, we demonstrate that an adaptive version of these heuristics which applies them iteratively, can effectively capture the interactions among samples and thus partially address the issues. Experiments on real-world datasets corroborate these theoretical findings and further demonstrate that the merit of adaptivity can extend to more complex scenarios such as classification tasks and non-linear neural networks. We conclude our analysis by emphasizing the inherent trade-off between performance and computational efficiency, questioning the use of additive metrics such as the Linear Datamodeling Score, and offering a range of discussions.
Problem

Research questions and friction points this paper is trying to address.

Training Sample Influence
Machine Learning Models
Complex Nonlinear Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved Algorithm
Complex Influence Handling
Efficiency Optimization
🔎 Similar Papers
Y
Yuzheng Hu
Department of Computer Science, University of Illinois Urbana-Champaign
Pingbang Hu
Pingbang Hu
University of Illinois Urbana-Champaign
Machine Learning
H
Han Zhao
Department of Computer Science, University of Illinois Urbana-Champaign
Jiaqi W. Ma
Jiaqi W. Ma
Assistant Professor, University of Illinois Urbana-Champaign
Data-Centric AIData AttributionTraining Data Curation