🤖 AI Summary
In conventional semantic segmentation, discrete embedding quantization leads to detail loss and degraded cross-domain robustness. This paper introduces the first diffusion-driven semantic generation framework based on continuous-value embeddings, formulating mask generation as a progressive image-to-continuous-embedding diffusion process. Key contributions include: (1) a diffusion-guided autoregressive Transformer that learns highly discriminative continuous semantic embedding spaces; (2) a unified architecture integrating KL-regularized VAE encoding, diffusion-based conditional generation, and decoding—enabling zero-shot cross-domain adaptation; and (3) joint modeling of feature representations and mask structural priors. Evaluated on Cityscapes and challenging domain-shift scenarios—including fog, snow, and viewpoint shifts—the framework achieves state-of-the-art robustness: AP ≈ 95% under Gaussian noise and motion blur, and AP ≈ 90% under salt-and-pepper noise and hue shift.
📝 Abstract
Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance ($approx$ 95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact ($approx$ 90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: https://github.com/mahmed10/CAMSS.git