🤖 AI Summary
Existing arbitrary style transfer methods face a fundamental trade-off: lightweight models yield low-fidelity outputs with prominent artifacts, whereas large models achieve higher visual quality but suffer from poor content-structure preservation and slow inference. This paper proposes a fine-tuning-free lightweight framework that, for the first time, embeds learnable explicit style priors into the frozen CLIP and diffusion Transformer (DiT) feature spaces, enabling effective content-style disentanglement. Our approach integrates CLIP-aligned guidance, DiT feature distillation, plug-and-play style adapters, and a contrastive style reconstruction loss—enabling zero-shot generalization to unseen styles and cross-domain reuse. On MSCOCO→WikiArt, our method reduces FID by 37%, improves style similarity by 2.1×, and achieves 48 fps inference speed at 512×512 resolution—significantly outperforming AdaIN, StyleCLIP, and LDM-Stylize.