๐ค AI Summary
Current text-to-image diffusion models rely on the CLIP text encoder, which struggles to disambiguate semantic categories (e.g., โcatโ) from artistic styles (e.g., โPokรฉmon styleโ), leading to imprecise fine-grained style generation. To address this, we propose the first category-style disentangled CLIP fine-tuning framework. Our method leverages lightweight style-annotated data and a redesigned cross-modal cross-attention mechanism to enable complementary learning of category and style representations. Crucially, it requires no architectural modification to the underlying diffusion model and supports plug-and-play integration for precise style control. Experiments across diverse stylistic domains demonstrate significant improvements over baselines: our approach enhances style fidelity, controllability, and generalization while preserving generation diversity and image quality.
๐ Abstract
Text-to-image diffusion models have shown remarkable capabilities of generating high-quality images closely aligned with textual inputs. However, the effectiveness of text guidance heavily relies on the CLIP text encoder, which is trained to pay more attention to general content but struggles to capture semantics in specific domains like styles. As a result, generation models tend to fail on prompts like"a photo of a cat in Pokemon style"in terms of simply producing images depicting"a photo of a cat". To fill this gap, we propose Control-CLIP, a novel decoupled CLIP fine-tuning framework that enables the CLIP model to learn the meaning of category and style in a complement manner. With specially designed fine-tuning tasks on minimal data and a modified cross-attention mechanism, Control-CLIP can precisely guide the diffusion model to a specific domain. Moreover, the parameters of the diffusion model remain unchanged at all, preserving the original generation performance and diversity. Experiments across multiple domains confirm the effectiveness of our approach, particularly highlighting its robust plug-and-play capability in generating content with various specific styles.