DiffStyler: Diffusion-based Localized Image Style Transfer

📅 2024-03-27
🏛️ arXiv.org
📈 Citations: 17
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of simultaneously preserving semantic fidelity and achieving precise style transfer in cross-image content-style co-migration, this paper proposes a diffusion-based masked-guided local style transfer method. Our approach introduces three key innovations: (1) the first use of LoRA modules to encapsulate style representations, leveraging their spatial consistency for mask-level localized style transfer; (2) a cross-LoRA feature fusion and attention injection mechanism enabling concurrent multi-style integration; and (3) a denoising fusion strategy guided by FastSAM-derived masks and mask-conditioned prompts. Comprehensive qualitative and quantitative evaluations demonstrate that our method achieves superior balance between content preservation and style fidelity compared to state-of-the-art methods, establishing new SOTA performance.

Technology Category

Application Category

📝 Abstract
Image style transfer aims to imbue digital imagery with the distinctive attributes of style targets, such as colors, brushstrokes, shapes, whilst concurrently preserving the semantic integrity of the content. Despite the advancements in arbitrary style transfer methods, a prevalent challenge remains the delicate equilibrium between content semantics and style attributes. Recent developments in large-scale text-to-image diffusion models have heralded unprecedented synthesis capabilities, albeit at the expense of relying on extensive and often imprecise textual descriptions to delineate artistic styles. Addressing these limitations, this paper introduces DiffStyler, a novel approach that facilitates efficient and precise arbitrary image style transfer. DiffStyler lies the utilization of a text-to-image Stable Diffusion model-based LoRA to encapsulate the essence of style targets. This approach, coupled with strategic cross-LoRA feature and attention injection, guides the style transfer process. The foundation of our methodology is rooted in the observation that LoRA maintains the spatial feature consistency of UNet, a discovery that further inspired the development of a mask-wise style transfer technique. This technique employs masks extracted through a pre-trained FastSAM model, utilizing mask prompts to facilitate feature fusion during the denoising process, thereby enabling localized style transfer that preserves the original image's unaffected regions. Moreover, our approach accommodates multiple style targets through the use of corresponding masks. Through extensive experimentation, we demonstrate that DiffStyler surpasses previous methods in achieving a more harmonious balance between content preservation and style integration.
Problem

Research questions and friction points this paper is trying to address.

Balances content semantics with style preservation in cross-domain image generation
Overcomes reliance on vague textual prompts for defining image styles
Enables flexible fusion of multiple models and styles through spatial and temporal combinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Customized models learn style representations
Cross-model feature and attention modulation
Spatial and temporal multi-model combinations
🔎 Similar Papers
No similar papers found.
S
Shaoxu Li
John Hopcroft Center for Computer Science, Shanghai Jiao Tong University