🤖 AI Summary
This work investigates scalability bottlenecks in centralized distributed optimization—particularly federated learning—under computational and communication time constraints. It analyzes the asymptotic growth of server-side communication cost and variance-dependent runtime with respect to the number of workers $n$, when optimizing an $L$-smooth, $d$-dimensional nonconvex objective using unbiased randomized sparsification compressors. A novel lower-bound construction technique is introduced, and under the homogeneous data assumption, a rigorous proof establishes that, regardless of compressor design, the convergence rates of both the server communication term and the variance term are at best polylogarithmic (e.g., $O(log^2 n)$), precluding linear or sublinear speedup. This result provides the first fundamental theoretical limit on scalability for the unbiased sparsification paradigm, delivering critical negative guidance for the design of distributed optimization algorithms.
📝 Abstract
We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $σ^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $τ_{s}$ and $τ_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $frac{h σ^2 L Δ}{n varepsilon^2},$ which improves with the number of workers $n,$ where $Δ= f(x^0) - f^*,$ and $x^0 in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $τ_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $τ_{s} d frac{L Δ}{varepsilon}$ and the variance-dependent runtime term $frac{h σ^2 L Δ}{varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new "worst-case" function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.