🤖 AI Summary
Existing image style transfer methods—particularly those built upon CNN or Transformer backbones—suffer from high computational complexity and slow inference due to their reliance on global receptive field modeling. To address this, we propose SaMam, an efficient state space model (SSM)-based framework tailored for arbitrary style transfer. Our key contributions are threefold: (1) a novel style-aware Mamba encoder-decoder architecture; (2) a local enhancement module coupled with a zigzag spatial scanning strategy to mitigate intrinsic SSM limitations—including pixel forgetting, channel redundancy, and spatial discontinuity; and (3) a style-conditioned state space modeling mechanism. Experiments demonstrate that SaMam achieves superior qualitative and quantitative performance over current state-of-the-art methods, while maintaining O(N) linear time complexity. It simultaneously improves style fidelity, content preservation, and inference speed.
📝 Abstract
Global effective receptive field plays a crucial role for image style transfer (ST) to obtain high-quality stylized results. However, existing ST backbones (e.g., CNNs and Transformers) suffer huge computational complexity to achieve global receptive fields. Recently, the State Space Model (SSM), especially the improved variant Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a approach to resolve the above dilemma. In this paper, we develop a Mamba-based style transfer framework, termed SaMam. Specifically, a mamba encoder is designed to efficiently extract content and style information. In addition, a style-aware mamba decoder is developed to flexibly adapt to various styles. Moreover, to address the problems of local pixel forgetting, channel redundancy and spatial discontinuity of existing SSMs, we introduce both local enhancement and zigzag scan. Qualitative and quantitative results demonstrate that our SaMam outperforms state-of-the-art methods in terms of both accuracy and efficiency.