🤖 AI Summary
Existing image style transfer methods struggle to precisely control local-region stylization, often causing unintended edits in non-target areas or degrading overall image quality. This paper formally defines and addresses the fine-grained, text-driven local style transfer problem for the first time. We propose an end-to-end localization-editing co-design framework built upon diffusion models: it incorporates an attention-guided region masking mechanism, cross-modal text-image alignment fine-tuning, and a localized editing loss to ensure semantic fidelity and spatial controllability. Our method effectively suppresses global style drift and structural distortion while preserving high visual fidelity. Quantitative evaluation shows significant improvements over state-of-the-art approaches in LPIPS and CLIP-Score; user studies further confirm superior perceptual quality and precise spatial control. (136 words)
📝 Abstract
Text-conditioned style transfer enables users to communicate their desired artistic styles through text descriptions, offering a new and expressive means of achieving stylization. In this work, we evaluate the text-conditioned image editing and style transfer techniques on their fine-grained understanding of user prompts for precise"local"style transfer. We find that current methods fail to accomplish localized style transfers effectively, either failing to localize style transfer to certain regions in the image, or distorting the content and structure of the input image. To this end, we develop an end-to-end pipeline for"local"style transfer tailored to align with users' intent. Further, we substantiate the effectiveness of our approach through quantitative and qualitative analysis. The project code is available at: https://github.com/silky1708/local-style-transfer.