🤖 AI Summary
In text-to-image generation, maintaining subject consistency (e.g., facial identity) across multiple independent generations remains challenging; existing approaches rely on fine-tuning, reference images, or cross-generation latent access, limiting generalizability. This paper proposes Contrastive Concept Instantiation (CoCoIns): a novel framework that first maps latent codes to pseudo-tokenized instance representations and enforces stable semantic binding via prompt-latent joint contrastive learning. By integrating a lightweight mapping network with semantic disentanglement in latent space, CoCoIns achieves zero-shot, high-fidelity cross-generation subject reproduction—without fine-tuning, reference images, or auxiliary inputs. On single-subject face generation, it matches the performance of state-of-the-art fine-tuning methods; moreover, it natively supports multi-subject and generic object categories. Experiments demonstrate significant improvements in consistency and practicality for long-sequence generation.
📝 Abstract
While text-to-image generative models can synthesize diverse and faithful contents, subject variation across multiple creations limits the application in long content generation. Existing approaches require time-consuming tuning, references for all subjects, or access to other creations. We introduce Contrastive Concept Instantiation (CoCoIns) to effectively synthesize consistent subjects across multiple independent creations. The framework consists of a generative model and a mapping network, which transforms input latent codes into pseudo-words associated with certain instances of concepts. Users can generate consistent subjects with the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to differentiate the combination of prompts and latent codes. Extensive evaluations of human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining higher flexibility. We also demonstrate the potential of extending CoCoIns to multiple subjects and other object categories.