🤖 AI Summary
Existing text-driven image style transfer methods rely on single-text prompts, limiting fine-grained and interpretable multi-style control. This paper proposes a multi-text prompt-driven style interpolation framework enabling seamless fusion of diverse artistic styles—such as Cubism, Impressionism, and cartoon—in a single image, with spatially and semantically controllable editing. Key contributions include: (1) the first multi-prompt embedding mixer with adaptive weighted interpolation; (2) a hierarchical masked directional loss ensuring regional style consistency; and (3) integration of the StyleMamba state-space model with cross-modal alignment optimization. Experiments demonstrate significant improvements in style fidelity, text–image alignment accuracy, and artistic expressiveness. User studies confirm superiority over single-prompt and linear interpolation baselines, while maintaining efficient inference.
📝 Abstract
Text-driven image style transfer has seen remarkable progress with methods leveraging cross-modal embeddings for fast, high-quality stylization. However, most existing pipelines assume a emph{single} textual style prompt, limiting the range of artistic control and expressiveness. In this paper, we propose a novel emph{multi-prompt style interpolation} framework that extends the recently introduced extbf{StyleMamba} approach. Our method supports blending or interpolating among multiple textual prompts (eg, ``cubism,'' ``impressionism,'' and ``cartoon''), allowing the creation of nuanced or hybrid artistic styles within a emph{single} image. We introduce a extit{Multi-Prompt Embedding Mixer} combined with extit{Adaptive Blending Weights} to enable fine-grained control over the spatial and semantic influence of each style. Further, we propose a emph{Hierarchical Masked Directional Loss} to refine region-specific style consistency. Experiments and user studies confirm our approach outperforms single-prompt baselines and naive linear combinations of styles, achieving superior style fidelity, text-image alignment, and artistic flexibility, all while maintaining the computational efficiency offered by the state-space formulation.