🤖 AI Summary
This work addresses the challenge of federated language modeling under heterogeneous communication bandwidths, where sharing raw data or high-precision gradients is infeasible. The authors propose Federated Probe-Logit Distillation (FPLD), a framework that enables efficient global model distillation by quantizing logit vectors over a shared probe set. Key contributions include the first bandwidth-dependent tight error bound, a multi-round residual quantization mechanism, and a closed-form optimal allocation strategy for heterogeneous bandwidths—termed the logarithmic tilt water-filling rule and its adaptive variant. Theoretically, the method achieves an error rate of Θ(K⁻¹·2⁻²ᴮ/ⱽ), with multi-round quantization provably outperforming single-round approaches. Experiments confirm that the proposed optimal allocation consistently surpasses uniform and inverse-weighted baselines under heterogeneous pruning settings.
📝 Abstract
In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open.
We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.