Comparison Reveals Commonality: Customized Image Generation through Contrastive Inversion

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the unsupervised discovery of shared visual concepts from few-shot image collections—without relying on external guidance such as text prompts or spatial masks. We propose a contrastive inversion framework that jointly optimizes contrastive learning objectives between target image tokens and image-level auxiliary text tokens to disentangle semantic features. Furthermore, we introduce a disentangled cross-attention fine-tuning mechanism during diffusion model inversion to enable fine-grained concept separation. Our method effectively suppresses overfitting while preserving concept fidelity. Experiments demonstrate significant improvements over state-of-the-art approaches in both concept representation accuracy and image editing quality: generated results exhibit higher purity, consistency, and semantic coherence. This work establishes a novel paradigm for few-shot customized image generation.

Technology Category

Application Category

📝 Abstract

The recent demand for customized image generation raises a need for techniques that effectively extract the common concept from small sets of images. Existing methods typically rely on additional guidance, such as text prompts or spatial masks, to capture the common target concept. Unfortunately, relying on manually provided guidance can lead to incomplete separation of auxiliary features, which degrades generation quality.In this paper, we propose Contrastive Inversion, a novel approach that identifies the common concept by comparing the input images without relying on additional information. We train the target token along with the image-wise auxiliary text tokens via contrastive learning, which extracts the well-disentangled true semantics of the target. Then we apply disentangled cross-attention fine-tuning to improve concept fidelity without overfitting. Experimental results and analysis demonstrate that our method achieves a balanced, high-level performance in both concept representation and editing, outperforming existing techniques.

Problem

Research questions and friction points this paper is trying to address.

Extracting common concepts from small image sets

Avoiding manual guidance for concept separation

Improving generation quality with disentangled semantics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Inversion identifies common concepts without guidance

Contrastive learning disentangles true semantics of target

Disentangled cross-attention fine-tuning enhances concept fidelity

🔎 Similar Papers

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization