Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges of optimization inconsistency and over-specialization in federated learning of pre-trained vision-language models, which arise from client heterogeneity and full-data local updates. To this end, the authors propose FedDTL, a novel framework that decouples image and text encoders between the server and clients and introduces a modality alignment mechanism to ensure consistent global semantic updates. Additionally, FedDTL employs a two-stage local fine-tuning strategy: an initial supervised fine-tuning phase for rapid warm-starting, followed by reinforcement learning to enhance generalization. This approach is the first to integrate decoupled encoder architectures with reinforcement learning–based local fine-tuning in federated vision-language learning, achieving a significant balance between global task adaptability and generalization across diverse data distributions—including label skew and feature shift—and under both few-shot and full-data settings.

📝 Abstract

Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

Problem

Research questions and friction points this paper is trying to address.

Federated Learning

Vision-Language Models

Optimization Inconsistency

Over-specialization

Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Training

Federated Learning

Vision-Language Models