Consistent Subject Generation via Contrastive Instantiated Concepts

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

In text-to-image generation, maintaining subject consistency (e.g., facial identity) across multiple independent generations remains challenging; existing approaches rely on fine-tuning, reference images, or cross-generation latent access, limiting generalizability. This paper proposes Contrastive Concept Instantiation (CoCoIns): a novel framework that first maps latent codes to pseudo-tokenized instance representations and enforces stable semantic binding via prompt-latent joint contrastive learning. By integrating a lightweight mapping network with semantic disentanglement in latent space, CoCoIns achieves zero-shot, high-fidelity cross-generation subject reproduction—without fine-tuning, reference images, or auxiliary inputs. On single-subject face generation, it matches the performance of state-of-the-art fine-tuning methods; moreover, it natively supports multi-subject and generic object categories. Experiments demonstrate significant improvements in consistency and practicality for long-sequence generation.

Technology Category

Application Category

📝 Abstract

While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.

Problem

Research questions and friction points this paper is trying to address.

Generates consistent subjects across multiple independent creations

Eliminates need for time-consuming tuning or reference subjects

Extends to multiple subjects and diverse object categories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning for subject consistency

Generative model with mapping network

Pseudo-words linked to latent codes

🔎 Similar Papers

CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization