🤖 AI Summary
This work addresses the statistical dependence among samples induced by data-adaptive sampling in pairwise learning tasks—such as ranking, metric learning, and AUC maximization—which invalidates conventional generalization analysis. To tackle this challenge, we pioneer a unification of algorithmic stability and PAC-Bayes theory, establishing the first unified generalization bound for arbitrary data-adaptive pairwise sampling strategies, accommodating both smooth and nonsmooth convex losses. Our framework applies broadly to pairwise stochastic gradient descent (PSGD) and pairwise stochastic gradient descent-ascent (PSGDA), delivering the first generalization guarantees for non-uniform pairwise sampling. This significantly extends the theoretical applicability beyond prior uniform-sampling assumptions. Moreover, our analysis provides rigorous theoretical foundations for practical scenarios involving adaptive sampling, including adversarial training. The resulting bounds are both principled and broadly applicable, bridging a critical gap between theory and practice in pairwise learning.
📝 Abstract
We study stochastic optimization with data-adaptive sampling schemes to train pairwise learning models. Pairwise learning is ubiquitous, and it covers several popular learning tasks such as ranking, metric learning and AUC maximization. A notable difference of pairwise learning from pointwise learning is the statistical dependencies among input pairs, for which existing analyses have not been able to handle in the general setting considered in this paper. To this end, we extend recent results that blend together two algorithm-dependent frameworks of analysis -- algorithmic stability and PAC-Bayes -- which allow us to deal with any data-adaptive sampling scheme in the optimizer. We instantiate this framework to analyze (1) pairwise stochastic gradient descent, which is a default workhorse in many machine learning problems, and (2) pairwise stochastic gradient descent ascent, which is a method used in adversarial training. All of these algorithms make use of a stochastic sampling from a discrete distribution (sample indices) before each update. Non-uniform sampling of these indices has been already suggested in the recent literature, to which our work provides generalization guarantees in both smooth and non-smooth convex problems.