🤖 AI Summary
In multimodal federated learning, heterogeneous client modality subsets and intra-modal input feature missingness cause misaligned local representations. To address this, we propose an adaptive representation alignment framework guided by learnable embedding control. Our method introduces: (1) a missing-pattern-aware reconfigurable encoder that generates client-specific reconfiguration signals; (2) an algorithmic aggregation strategy driven by similarity among missing patterns across clients; and (3) a theoretically grounded performance bound to inform optimization. Evaluated on multiple benchmarks under severe feature missingness, our approach achieves up to a 36.45% improvement in model accuracy. It significantly enhances consistency between the global model and heterogeneous local data distributions, while improving generalization across diverse modalities and missingness patterns. The framework is robust to both inter-client modality heterogeneity and intra-modal feature incompleteness, offering a principled solution for representation alignment without requiring full modality availability at all clients.
📝 Abstract
Multimodal federated learning in real-world settings often encounters incomplete and heterogeneous data across clients. This results in misaligned local feature representations that limit the effectiveness of model aggregation. Unlike prior work that assumes either differing modality sets without missing input features or a shared modality set with missing features across clients, we consider a more general and realistic setting where each client observes a different subset of modalities and might also have missing input features within each modality. To address the resulting misalignment in learned representations, we propose a new federated learning framework featuring locally adaptive representations based on learnable client-side embedding controls that encode each client's data-missing patterns.
These embeddings serve as reconfiguration signals that align the globally aggregated representation with each client's local context, enabling more effective use of shared information. Furthermore, the embedding controls can be algorithmically aggregated across clients with similar data-missing patterns to enhance the robustness of reconfiguration signals in adapting the global representation. Empirical results on multiple federated multimodal benchmarks with diverse data-missing patterns across clients demonstrate the efficacy of the proposed method, achieving up to 36.45% performance improvement under severe data incompleteness. The method is also supported by a theoretical analysis with an explicit performance bound that matches our empirical observations. Our source codes are provided at https://github.com/nmduonggg/PEPSY