RealCustom++: Representing Images as Real-Word for Real-Time Customization

📅 2024-08-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image customization methods rely on “pseudo-tokens” to represent subjects, inherently compromising the trade-off between subject fidelity and textual controllability—leading to semantic inconsistencies. This work proposes a novel “true-token” representation paradigm, directly modeling customized subjects using natural-language vocabulary to eliminate optimization conflicts induced by pseudo-tokens. Methodologically, we introduce a training-inference decoupled architecture, incorporating a cross-layer, multi-scale projector, a curriculum learning strategy, and an adaptive mask-guided mechanism to jointly optimize subject-specific features and text conditions. Evaluated on open-domain benchmarks, our approach significantly improves the balance between subject reconstruction fidelity and textual alignment, enabling high-quality, real-time, and precise customization. It offers a more interpretable and deployment-friendly pathway for controllable image generation.

Technology Category

Application Category

📝 Abstract
Text-to-image customization, which takes given texts and images depicting given subjects as inputs, aims to synthesize new images that align with both text semantics and subject appearance. This task provides precise control over details that text alone cannot capture and is fundamental for various real-world applications, garnering significant interest from academia and industry. Existing works follow the pseudo-word paradigm, which involves representing given subjects as pseudo-words and combining them with given texts to collectively guide the generation. However, the inherent conflict and entanglement between the pseudo-words and texts result in a dual-optimum paradox, where subject similarity and text controllability cannot be optimal simultaneously. We propose a novel real-words paradigm termed RealCustom++ that instead represents subjects as non-conflict real words, thereby disentangling subject similarity from text controllability and allowing both to be optimized simultaneously. Specifically, RealCustom++ introduces a novel"train-inference"decoupled framework: (1) During training, RealCustom++ learns the alignment between vision conditions and all real words in the text, ensuring high subject-similarity generation in open domains. This is achieved by the cross-layer cross-scale projector to robustly and finely extract subject features, and a curriculum training recipe that adapts the generated subject to diverse poses and sizes. (2) During inference, leveraging the learned general alignment, an adaptive mask guidance is proposed to only customize the generation of the specific target real word, keeping other subject-irrelevant regions uncontaminated to ensure high text-controllability in real-time.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Image Synthesis
Thematic Consistency
Textual Control Accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

RealCustom++
True Word Alignment
Selective Textual Control
🔎 Similar Papers
No similar papers found.