🤖 AI Summary
Existing attention-based arbitrary style transfer methods (CNN-, Transformer-, or Diffusion-based) suffer from region-level style misalignment when content and style images share semantic consistency, primarily due to neglecting the coupling between local texture patterns and semantic regions. Method: We propose a plug-and-play Semantic Continuous–Sparse Attention (SCSA) mechanism that explicitly decouples modeling into two complementary components: (i) continuous holistic style representation within semantically coherent regions—guided by semantic segmentation and realized via continuous attention; and (ii) sparse local texture representation—obtained through semantic-constrained sparse similarity retrieval. These components are jointly optimized to enforce semantic alignment while preserving fine-grained texture fidelity. Results: On multiple benchmarks, our method achieves a 12.6% reduction in FID and a 23.4% improvement in semantic style matching accuracy, significantly outperforming diverse state-of-the-art baselines.
📝 Abstract
Attention-based arbitrary style transfer methods, including CNN-based, Transformer-based, and Diffusion-based, have flourished and produced high-quality stylized images. However, they perform poorly on the content and style images with the same semantics, i.e., the style of the corresponding semantic region of the generated stylized image is inconsistent with that of the style image. We argue that the root cause lies in their failure to consider the relationship between local regions and semantic regions. To address this issue, we propose a plug-and-play semantic continuous-sparse attention, dubbed SCSA, for arbitrary semantic style transfer -- each query point considers certain key points in the corresponding semantic region. Specifically, semantic continuous attention ensures each query point fully attends to all the continuous key points in the same semantic region that reflect the overall style characteristics of that region; Semantic sparse attention allows each query point to focus on the most similar sparse key point in the same semantic region that exhibits the specific stylistic texture of that region. By combining the two modules, the resulting SCSA aligns the overall style of the corresponding semantic regions while transferring the vivid textures of these regions. Qualitative and quantitative results prove that SCSA enables attention-based arbitrary style transfer methods to produce high-quality semantic stylized images.