🤖 AI Summary
Multimodal federated learning faces significant challenges due to client-side modality heterogeneity and inconsistent model architectures, which hinder feature alignment, incur high communication costs, and compromise robustness. To address these issues, this work proposes CoMFed, a novel framework that introduces, for the first time, a latent-space consensus mechanism. CoMFed employs learnable projection matrices to generate compact cross-modal latent representations and incorporates a latent-space regularization term to align representations across clients. This approach preserves data privacy while substantially reducing both communication and computational overhead, and enhances robustness against outliers. Experimental results demonstrate that CoMFed achieves competitive accuracy on human activity recognition benchmarks.
📝 Abstract
Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, but applying FL to multi-modal settings introduces significant challenges. Clients typically possess heterogeneous modalities and model architectures, making it difficult to align feature spaces efficiently while preserving privacy and minimizing communication costs. To address this, we introduce CoMFed, a Communication-Efficient Multi-Modal Federated Learning framework that uses learnable projection matrices to generate compressed latent representations. A latent-space regularizer aligns these representations across clients, improving cross-modal consistency and robustness to outliers. Experiments on human activity recognition benchmarks show that CoMFed achieves competitive accuracy with minimal overhead.