🤖 AI Summary
This work addresses the challenge of federated multi-label recognition, which is highly susceptible to client data heterogeneity, often leading models to overfit spurious label correlations and erroneously activate irrelevant categories. To tackle this issue, the paper introduces the first dedicated framework that integrates causal inference, large language model–driven discovery of label dependency conditions, optimal transport–guided image-prompt alignment, and a gated prediction fusion mechanism. By leveraging front-door adjustment and disentangling intermediate variables, combined with generalizable condition-enhanced prompt tuning, the method effectively suppresses incorrect label activations. Extensive experiments on multiple benchmark datasets demonstrate that the proposed approach significantly outperforms existing methods, confirming its superiority and strong generalization capability in heterogeneous federated settings.
📝 Abstract
Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.