Diffusion-Guided Semantic Consistency for Multimodal Heterogeneity

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes SemanticFL, a novel federated learning framework designed to address semantic inconsistency arising from non-IID multimodal data. SemanticFL is the first to leverage the rich hierarchical semantic priors embedded in pretrained diffusion models—such as Stable Diffusion—by exploiting their VAE latent space and U-Net multi-level features to construct a shared semantic representation across heterogeneous clients. The framework further integrates server-side computational offloading, cross-modal contrastive learning, and a privacy-preserving semantic guidance mechanism to enable effective semantic alignment and stable optimization. Extensive experiments on CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that SemanticFL significantly outperforms existing approaches, achieving up to a 5.49% accuracy improvement over FedAvg and effectively bridging the multimodal semantic gap.

Technology Category

Application Category

📝 Abstract
Federated learning (FL) is severely challenged by non-independent and identically distributed (non-IID) client data, a problem that degrades global model performance, especially in multimodal perception settings. Conventional methods often fail to address the underlying semantic discrepancies between clients, leading to suboptimal performance for multimedia systems requiring robust perception. To overcome this, we introduce SemanticFL, a novel framework that leverages the rich semantic representations of pre-trained diffusion models to provide privacy-preserving guidance for local training. Our approach leverages multi-layer semantic representations from a pre-trained Stable Diffusion model (including VAE-encoded latents and U-Net hierarchical features) to create a shared latent space that aligns heterogeneous clients, facilitated by an efficient client-server architecture that offloads heavy computation to the server. A unified consistency mechanism, employing cross-modal contrastive learning, further stabilizes convergence. We conduct extensive experiments on benchmarks including CIFAR-10, CIFAR-100, and TinyImageNet under diverse heterogeneity scenarios. Our results demonstrate that SemanticFL surpasses existing federated learning approaches, achieving accuracy gains of up to 5.49% over FedAvg, validating its effectiveness in learning robust representations for heterogeneous and multimodal data for perception tasks.
Problem

Research questions and friction points this paper is trying to address.

federated learning
non-IID
multimodal heterogeneity
semantic discrepancy
perception tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Models
Federated Learning
Semantic Consistency
Multimodal Heterogeneity
Contrastive Learning
🔎 Similar Papers
No similar papers found.