PARIS: Pruning Algorithm via the Representer theorem for Imbalanced Scenarios

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Empirical Risk Minimization (ERM) suffers from overfitting to high-density regions in imbalanced regression, severely degrading predictive performance on tail events. Method: This paper proposes a dataset optimization paradigm grounded in representation theory: it introduces a closed-form analytical solution for residual influence, enabling exact, retraining-free evaluation of individual sample impact on validation loss; integrates Cholesky rank-one updates for efficient low-rank pruning; and drives iterative dataset compression via validation loss change. Results: On real-world space weather forecasting, the method retains—or even reduces—RMSE using only 25% of training data, significantly outperforming reweighting, oversampling, and boosting baselines. Core contributions include: (i) a theory-driven data pruning framework; (ii) an interpretable, retraining-free sample influence assessment mechanism; and (iii) an efficient training-set compression paradigm explicitly designed for tail-robust regression.

Technology Category

Application Category

📝 Abstract
The challenge of extbf{imbalanced regression} arises when standard Empirical Risk Minimization (ERM) biases models toward high-frequency regions of the data distribution, causing severe degradation on rare but high-impact ``tail'' events. Existing strategies uch as loss re-weighting or synthetic over-sampling often introduce noise, distort the underlying distribution, or add substantial algorithmic complexity. We introduce extbf{PARIS} (Pruning Algorithm via the Representer theorem for Imbalanced Scenarios), a principled framework that mitigates imbalance by emph{optimizing the training set itself}. PARIS leverages the representer theorem for neural networks to compute a extbf{closed-form representer deletion residual}, which quantifies the exact change in validation loss caused by removing a single training point emph{without retraining}. Combined with an efficient Cholesky rank-one downdating scheme, PARIS performs fast, iterative pruning that eliminates uninformative or performance-degrading samples. We use a real-world space weather example, where PARIS reduces the training set by up to 75% while preserving or improving overall RMSE, outperforming re-weighting, synthetic oversampling, and boosting baselines. Our results demonstrate that representer-guided dataset pruning is a powerful, interpretable, and computationally efficient approach to rare-event regression.
Problem

Research questions and friction points this paper is trying to address.

Addresses imbalanced regression by optimizing training set composition
Mitigates bias towards high-frequency data regions in ERM models
Improves rare-event prediction via efficient dataset pruning without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pruning training set using representer theorem residuals
Closed-form deletion residual without model retraining
Efficient Cholesky rank-one downdate for iterative pruning
🔎 Similar Papers
No similar papers found.