π€ AI Summary
This work addresses the bias of conventional prototype selection methods toward majority classes under class imbalance, which often compromises the representativeness of minority classes. To mitigate this issue, we formulate uniform-weight prototype selection as a partial optimal transport problem, selecting prototypes by minimizing the optimal transport distance between the prototype distribution and a target distribution. We reformulate the original supermodular objective into a submodular function and develop a greedy optimization algorithm with a provable $(1-1/e)$ approximation guarantee. The proposed approach demonstrates compelling theoretical and empirical advantages: it significantly improves minority-class performance across multiple imbalanced classification benchmarks without degrading majority-class accuracy, and yields consistent gains when applied to pretraining and fine-tuning of large language models.
π Abstract
Selecting prototypical examples from a source distribution to represent a target data distribution is a fundamental problem in machine learning. Existing subset selection methods often rely on implicit importance scores, which can be skewed towards majority classes and lead to low-quality prototypes for minority classes. We present $\methodprop$, a novel subset selection framework that minimizes the optimal transport (OT) distance between a uniformly weighted prototypical distribution and the target distribution. While intuitive, this formulation leads to a cardinality-constrained maximization of a \emph{super-additive} objective, which is generally intractable to approximate efficiently. To address this, we propose a principled reformulation of the OT marginal constraints, yielding a partial optimal transport-based submodular objective. We prove that this reformulation enables a greedy algorithm with a $(1-1/e)$ approximation guarantee relative to the original super-additive maximization problem. Empirically, we showcase that enforcing uniform prototype weights in UniPROT consistently improves minority-class representation in imbalanced classification benchmarks without compromising majority-class accuracy. In both finetuning and pretraining regimes for large language models under domain imbalance, UniPROT enforces uniform source contributions, yielding robust performance gains. Our results establish UniPROT as a scalable, theoretically grounded solution for uniform-weighted prototype selection. Our code is publicly available at GitHub\footnote{Code: https://github.com/efficiency-learning/UniPROT}