🤖 AI Summary
This work addresses the unsupervised discovery of shared visual concepts from few-shot image collections—without relying on external guidance such as text prompts or spatial masks. We propose a contrastive inversion framework that jointly optimizes contrastive learning objectives between target image tokens and image-level auxiliary text tokens to disentangle semantic features. Furthermore, we introduce a disentangled cross-attention fine-tuning mechanism during diffusion model inversion to enable fine-grained concept separation. Our method effectively suppresses overfitting while preserving concept fidelity. Experiments demonstrate significant improvements over state-of-the-art approaches in both concept representation accuracy and image editing quality: generated results exhibit higher purity, consistency, and semantic coherence. This work establishes a novel paradigm for few-shot customized image generation.
📝 Abstract
The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.