🤖 AI Summary
In multi-task, multi-modal federated learning, clients exhibit heterogeneity across data distributions, task objectives, and modalities—posing significant challenges for personalized federated learning (PFL).
Method: This paper proposes TAP, a two-stage adaptive personalization framework. TAP introduces the first server-side convergence analysis for modality-task pair architectures; jointly enhances generalization and personalization via mismatched architecture design and post-federated knowledge distillation; and enables dynamic model adaptation and cross-modality–cross-task knowledge transfer through an adaptive replacement mechanism and two-stage optimization.
Contribution/Results: Evaluated on multiple multi-task, multi-modal benchmarks, TAP consistently outperforms state-of-the-art PFL and multi-modal federated learning methods in both accuracy and robustness. The framework demonstrates strong scalability to heterogeneous client settings and achieves superior personalization without compromising global utility. The implementation is publicly available.
📝 Abstract
Federated Learning (FL), despite demonstrating impressive capabilities in the training of multiple models in a decentralized manner, has been shown to produce a final model not necessarily well-suited to the needs of each client. While extensive work has been conducted on how to create tailored personalized models, called Personalized Federated Learning (PFL), less attention has been given to personalization via fine-tuning of foundation models with multi-task and multi-modal properties. Moreover, there exists a lack of understanding in the literature on how to fine-tune and personalize such models in a setting that is heterogeneous across clients not only in data, but also in tasks and modalities. To address this gap in the literature, we propose TAP (Two-Stage Adaptive Personalization), which (i) leverages mismatched model architectures between the clients and server to selectively conduct replacement operations when it benefits a client's local tasks and (ii) engages in post-FL knowledge distillation for capturing beneficial general knowledge without compromising personalization. We also introduce the first convergence analysis of the server model under its modality-task pair architecture, and demonstrate that as the number of modality-task pairs increases, its ability to cater to all tasks suffers. Through extensive experiments, we demonstrate the effectiveness of our proposed algorithm across a variety of datasets and tasks in comparison to a multitude of baselines. Implementation code is publicly available at https://github.com/lee3296/TAP.