π€ AI Summary
This work addresses model instability and client drift in federated learning caused by heterogeneous client data by proposing a two-stage federated optimization framework. The approach begins with a full-model warm-up phase, followed by freezing the query and key modules of the Transformer attention mechanism while exclusively optimizing the value module. This study is the first to reveal the distinct roles of query/key versus value modules in federated optimization, establishes a theoretical trade-off between warm-up length and the biasβdrift dilemma, and introduces a novel paradigm based on attention kernel freezing. Integrating linear attention modeling, module decomposition, and kernel regularization analysis, experiments on real-world heterogeneous data validate the theoretical predictions and demonstrate significantly improved training stability and model performance.
π Abstract
Federated learning with heterogeneous clients remains a significant challenge for deep learning, primarily due to client drift arising from inconsistent local updates. Existing federated optimization methods typically address this issue through objective-level regularization or update-correction mechanisms. Recent studies, however, suggest that Transformer-based architectures may be inherently more robust than conventional models under heterogeneous federated training. Motivated by this observation, we investigate how different parameter components within the attention mechanism influence federated optimization. Specifically, we decompose the attention module into a query/key block, which determines the attention kernel, and a value block, which performs semantic transformation under the induced kernel. Based on this perspective, we propose FedFrozen, a two-stage federated optimization framework that first performs full-model warm-up training and then freezes the query/key block while continuing to optimize the value block. Under a linear-attention formulation, we show that the warm-up stage can be interpreted as an inexact descent procedure on a regularized kernel-profile objective, while the frozen stage reduces to a restricted value-block optimization problem under a fixed attention kernel. Our analysis further reveals an explicit trade-off that governs the choice of warm-up length. Simulations validate the predicted bias-drift behavior, and real-data experiments demonstrate that FedFrozen improves both the stability and effectiveness of Transformer models in heterogeneous federated learning.