Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic segmentation models trained on synthetic data suffer significant performance degradation in real-world scenarios due to domain shift—especially under adverse conditions with scarce annotations. To address this, we propose CACTI and CACTIF, two diffusion-based semantic-consistent style transfer methods. CACTI introduces class-aware adaptive instance normalization to achieve semantic alignment, while CACTIF further incorporates feature-similarity-guided cross-attention filtering to jointly preserve structural fidelity and stylistic realism. Both methods employ semantic-constrained reverse-diffusion fine-tuning, enabling effective adaptation with only a minimal number of target-domain images. Evaluated on the GTA5→Cityscapes and GTA5→ACDC cross-domain benchmarks, our approach substantially reduces Fréchet Inception Distance (FID) and significantly improves mean Intersection-over-Union (mIoU), effectively bridging the synthetic-to-real domain gap.

Technology Category

Application Category

📝 Abstract
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.
Problem

Research questions and friction points this paper is trying to address.

Bridges synthetic-to-real domain gap for semantic segmentation
Enhances style transfer with semantic consistency techniques
Improves vision models using diffusion models on synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI)
Selective attention Filtering (CACTIF) for cross-attention maps
Diffusion-based style transfer preserving semantic boundaries
🔎 Similar Papers
No similar papers found.
E
Estelle Chigot
ISAE-Supaero, University of Toulouse, France; Airbus, France
D
Dennis G. Wilson
ISAE-Supaero, University of Toulouse, France
M
Meriem Ghrib
Airbus, France; ISAE-Supaero, University of Toulouse, France
Thomas Oberlin
Thomas Oberlin
ISAE-SUPAERO, Université de Toulouse
Signal and image processingmachine learning