LARP: Learner-Agnostic Robust Data Prefiltering

📅 2025-06-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Public datasets often contain low-quality or contaminated samples, jeopardizing the performance of downstream learners. This paper formally defines the “learner-agnostic robust data pre-filtering” problem: minimizing the joint loss across a set of heterogeneous downstream learners under worst-case contamination. We propose a game-theoretic pre-filtering framework that explicitly models the trade-off between statistical utility and computational cost. Theoretical analysis reveals an inherent statistical utility cost of generic pre-filtering and demonstrates its substantial reduction in redundant preprocessing overhead at scale. Using scalar mean estimation under the Huber contamination model as a theoretical benchmark—and corroborating with large-scale experiments on image and tabular data—we establish hardness results and empirical validity. Results show that while the method incurs a controlled, bounded degradation in statistical utility, it significantly improves deployment efficiency and robustness, achieving a Pareto improvement in the utility–cost trade-off.

Technology Category

Application Category

📝 Abstract

The widespread availability of large public datasets is a key factor behind the recent successes of statistical inference and machine learning methods. However, these datasets often contain some low-quality or contaminated data, to which many learning procedures are sensitive. Therefore, the question of whether and how public datasets should be prefiltered to facilitate accurate downstream learning arises. On a technical level this requires the construction of principled data prefiltering methods which are learner-agnostic robust, in the sense of provably protecting a set of pre-specified downstream learners from corrupted data. In this work, we formalize the problem of Learner-Agnostic Robust data Prefiltering (LARP), which aims at finding prefiltering procedures that minimize a worst-case loss over a pre-specified set of learners. We first instantiate our framework in the context of scalar mean estimation with Huber estimators under the Huber data contamination model. We provide a hardness result on a specific problem instance and analyze several natural prefiltering procedures. Our theoretical results indicate that performing LARP on a heterogeneous set of learners leads to some loss in model performance compared to the alternative of prefiltering data for each learner/use-case individually. We explore the resulting utility loss and its dependence on the problem parameters via extensive experiments on real-world image and tabular data, observing statistically significant reduction in utility. Finally, we model the trade-off between the utility drop and the cost of repeated (learner-specific) prefiltering within a game-theoretic framework and showcase benefits of LARP for large datasets.

Problem

Research questions and friction points this paper is trying to address.

Developing robust prefiltering methods for contaminated public datasets

Minimizing worst-case loss across diverse downstream learners

Balancing utility loss and cost in large-scale data prefiltering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learner-Agnostic Robust Prefiltering (LARP) framework

Huber estimators for mean estimation

Game-theoretic trade-off modeling

🔎 Similar Papers

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge