🤖 AI Summary
Public datasets often contain low-quality or contaminated samples, jeopardizing the performance of downstream learners. This paper formally defines the “learner-agnostic robust data pre-filtering” problem: minimizing the joint loss across a set of heterogeneous downstream learners under worst-case contamination. We propose a game-theoretic pre-filtering framework that explicitly models the trade-off between statistical utility and computational cost. Theoretical analysis reveals an inherent statistical utility cost of generic pre-filtering and demonstrates its substantial reduction in redundant preprocessing overhead at scale. Using scalar mean estimation under the Huber contamination model as a theoretical benchmark—and corroborating with large-scale experiments on image and tabular data—we establish hardness results and empirical validity. Results show that while the method incurs a controlled, bounded degradation in statistical utility, it significantly improves deployment efficiency and robustness, achieving a Pareto improvement in the utility–cost trade-off.
📝 Abstract
The widespread availability of large public datasets is a key factor behind the recent successes of statistical inference and machine learning methods. However, these datasets often contain some low-quality or contaminated data, to which many learning procedures are sensitive. Therefore, the question of whether and how public datasets should be prefiltered to facilitate accurate downstream learning arises. On a technical level this requires the construction of principled data prefiltering methods which are learner-agnostic robust, in the sense of provably protecting a set of pre-specified downstream learners from corrupted data. In this work, we formalize the problem of Learner-Agnostic Robust data Prefiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We first instantiate our framework in the context of scalar mean estimation with Huber estimators under the Huber data contamination model. We provide a hardness result on a specific problem instance and analyze several natural prefiltering procedures. Our theoretical results indicate that performing LARP on a heterogeneous set of learners leads to some loss in model performance compared to the alternative of prefiltering data for each learner/use-case individually. We explore the resulting utility loss and its dependence on the problem parameters via extensive experiments on real-world image and tabular data, observing statistically significant reduction in utility. Finally, we model the trade-off between the utility drop and the cost of repeated (learner-specific) prefiltering within a game-theoretic framework and showcase benefits of LARP for large datasets.