🤖 AI Summary
This paper investigates the fundamental sample complexity limits of active multi-distribution learning. Addressing the lack of optimality guarantees for existing algorithms and the looseness of current label complexity bounds, it establishes the first information-theoretically tight upper and lower bounds. Methodologically, it integrates VC dimension theory, the maximum disagreement coefficient, and multi-distribution error analysis to design both distribution-dependent and distribution-independent active querying strategies, and introduces an instance-dependent passive bound bridging the realizable and agnostic settings. Theoretical contributions include: (1) achieving a label complexity of $widetilde{O}( heta_{max}(d+k)lnfrac{1}{varepsilon})$ under the nearly realizable setting and proving its optimality; (2) deriving $widetilde{O}( heta_{max}(d+k)(lnfrac{1}{varepsilon}+frac{
u^2}{varepsilon^2})+frac{k
u}{varepsilon^2})$ in the agnostic setting, revealing the intrinsic impact of the $frac{k
u}{varepsilon^2}$ term arising from the true learner’s approximation error. These results provide foundational theoretical support for collaborative learning, fairness, and robustness.
📝 Abstract
Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, ${D_i}_{iin[k]}$, is considered and a classifier's performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown.
In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of $widetilde{O}Bigl(θ_{max}(d+k)lnfrac{1}{varepsilon}Bigr)$ and $widetilde{O}Bigl(θ_{max}(d+k)Bigl(lnfrac{1}{varepsilon}+frac{ν^2}{varepsilon^2}Bigr)+frac{kν}{varepsilon^2}Bigr)$ in the realizable and agnostic settings respectively, where $θ_{max}$ is the maximum disagreement coefficient among the $k$ distributions, $d$ is the VC dimension of the hypothesis class, $ν$ is the multi-distribution error of the best hypothesis, and $varepsilon$ is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the $kν/varepsilon^2$ term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes~citep{blum2017collaborative,zhang2024optimal}, which may be of independent interest.