Sissi: Zero-shot Style-guided Image Synthesis via Semantic-style Integration

๐Ÿ“… 2026-01-10
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of achieving precise example-based stylization while preserving both semantic fidelity and stylistic authenticityโ€”a balance often compromised by existing methods that rely on task-specific retraining or costly inverse mapping. The authors reformulate the problem as a zero-shot contextual learning task, leveraging a pre-trained ReFlow inpainting model by directly concatenating a reference style image with a masked target image to jointly embed semantic content and visual style. Central to their approach is the Dynamic Semantic-Style Integration (DSSI) mechanism, which adaptively reweights attention contributions from textual and visual guidance to mitigate multimodal conflicts. Requiring no additional training, the proposed method significantly outperforms current state-of-the-art techniques in both semantic-style balance and overall generation quality.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-guided image generation has advanced rapidly with large-scale diffusion models, yet achieving precise stylization with visual exemplars remains difficult. Existing approaches often depend on task-specific retraining or expensive inversion procedures, which can compromise content integrity, reduce style fidelity, and lead to an unsatisfactory trade-off between semantic prompt adherence and style alignment. In this work, we introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task. Guided by textual semantic prompts, our method concatenates a reference style image with a masked target image, leveraging a pretrained ReFlow-based inpainting model to seamlessly integrate semantic content with the desired style through multimodal attention fusion. We further analyze the imbalance and noise sensitivity inherent in multimodal attention fusion and propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between textual semantic and style visual tokens, effectively resolving guidance conflicts and enhancing output coherence. Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality, offering a simple yet powerful alternative to complex, artifact-prone prior methods.
Problem

Research questions and friction points this paper is trying to address.

style-guided image synthesis
zero-shot stylization
semantic-style alignment
text-to-image generation
visual exemplar
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot style transfer
semantic-style integration
multimodal attention fusion
in-context learning
training-free image synthesis
๐Ÿ”Ž Similar Papers
No similar papers found.
Yingying Deng
Yingying Deng
University of Science and Technology Beijing
computer visionAIGC
X
Xiangyu He
Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China.
F
Fan Tang
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, 100040, China.
W
Weiming Dong
Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China.
X
Xucheng Yin
Department of Computer Science and Technology, University of Science and Technology Beijing, Beijing, 100083, China.