Generalization Bounds and Stopping Rules for Learning with Self-Selected Data

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This paper addresses the challenge of guaranteeing generalization in self-selective data learning paradigms—including active learning, semi-supervised learning, and multi-armed bandits—where distributional assumptions are often unrealistic or unverifiable. Methodologically, it establishes the first distribution-free generalization bound framework, introducing a covering-number-based and Wasserstein-ambiguity-set-driven bound for “reciprocal learning,” and integrates algorithmic stability with finite-step convergence analysis to derive an *anytime-valid* early-stopping rule applicable at arbitrary iteration steps. Key contributions are: (1) the first universal generalization bound that makes no assumptions on the underlying data distribution; (2) a verifiable and deployable adaptive early-stopping mechanism; and (3) empirical validation in semi-supervised learning, demonstrating both tightness of the bound and practical utility of the stopping rule—enabling training termination with controllable error guarantees.

Technology Category

Application Category

📝 Abstract

Many learning paradigms self-select training data in light of previously learned parameters. Examples include active learning, semi-supervised learning, bandits, or boosting. Rodemann et al. (2024) unify them under the framework of"reciprocal learning". In this article, we address the question of how well these methods can generalize from their self-selected samples. In particular, we prove universal generalization bounds for reciprocal learning using covering numbers and Wasserstein ambiguity sets. Our results require no assumptions on the distribution of self-selected data, only verifiable conditions on the algorithms. We prove results for both convergent and finite iteration solutions. The latter are anytime valid, thereby giving rise to stopping rules for a practitioner seeking to guarantee the out-of-sample performance of their reciprocal learning algorithm. Finally, we illustrate our bounds and stopping rules for reciprocal learning's special case of semi-supervised learning.

Problem

Research questions and friction points this paper is trying to address.

Study generalization bounds for self-selected data learning

Develop stopping rules for reciprocal learning algorithms

Analyze performance in semi-supervised learning scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unify learning paradigms under reciprocal learning

Prove generalization bounds using covering numbers

Develop stopping rules for out-of-sample performance

🔎 Similar Papers

Sample Selection Bias in Machine Learning for Healthcare