Towards Fundamental Limits for Active Multi-distribution Learning

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This paper investigates the fundamental sample complexity limits of active multi-distribution learning. Addressing the lack of optimality guarantees for existing algorithms and the looseness of current label complexity bounds, it establishes the first information-theoretically tight upper and lower bounds. Methodologically, it integrates VC dimension theory, the maximum disagreement coefficient, and multi-distribution error analysis to design both distribution-dependent and distribution-independent active querying strategies, and introduces an instance-dependent passive bound bridging the realizable and agnostic settings. Theoretical contributions include: (1) achieving a label complexity of $widetilde{O}( heta_{max}(d+k)lnfrac{1}{varepsilon})$ under the nearly realizable setting and proving its optimality; (2) deriving $widetilde{O}( heta_{max}(d+k)(lnfrac{1}{varepsilon}+frac{ u^2}{varepsilon^2})+frac{k u}{varepsilon^2})$ in the agnostic setting, revealing the intrinsic impact of the $frac{k u}{varepsilon^2}$ term arising from the true learner’s approximation error. These results provide foundational theoretical support for collaborative learning, fairness, and robustness.

Technology Category

Application Category

📝 Abstract

Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, ${D_i}_{iin[k]}$, is considered and a classifier's performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of $widetilde{O}Bigl(θ_{max}(d+k)lnfrac{1}{varepsilon}Bigr)$ and $widetilde{O}Bigl(θ_{max}(d+k)Bigl(lnfrac{1}{varepsilon}+frac{ν^2}{varepsilon^2}Bigr)+frac{kν}{varepsilon^2}Bigr)$ in the realizable and agnostic settings respectively, where $θ_{max}$ is the maximum disagreement coefficient among the $k$ distributions, $d$ is the VC dimension of the hypothesis class, $ν$ is the multi-distribution error of the best hypothesis, and $varepsilon$ is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the $kν/varepsilon^2$ term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes~citep{blum2017collaborative,zhang2024optimal}, which may be of independent interest.

Problem

Research questions and friction points this paper is trying to address.

Active multi-distribution learning lacks optimal algorithms

Improving label complexity bounds in distribution settings

Establishing fundamental limits for worst-case classifier performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active multi-distribution learning algorithms

Improved label complexity bounds

Distribution-dependent and free settings

🔎 Similar Papers

A Unified Approach Towards Active Learning and Out-of-Distribution Detection

2024-05-18Trans. Mach. Learn. Res.Citations: 5

Motional

$172,000—$229,000 USD

Boston / Pittsburgh / Las Vegas

Authors to Follow