🤖 AI Summary
This work addresses the challenges of personalized style control and content-style disentanglement in image stylization without additional training. We propose an inference-time stylization method built upon pretrained diffusion models, featuring a three-branch prompt-guidance mechanism and a scale-autoregressive generation paradigm to explicitly decouple content and style during inference. To reveal the dominant roles of early-to-mid generation stages in structuring content and encoding style, we introduce key-stage attention sharing and adaptive query sharing. Fine-grained collaborative control is achieved via step-level and attention-level interventions, joint prompt-feature injection, and query similarity fusion. Experiments demonstrate that our approach matches fine-tuning methods in style fidelity and prompt alignment, achieves significantly faster inference, and exhibits strong cross-style generalization and deployment flexibility.
📝 Abstract
We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design--content, style, and generation--each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.