TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image methods struggle to simultaneously disentangle and precisely control multiple personalized visual concepts—such as objects, materials, poses, and lighting—limiting both generation diversity and local editing fidelity. To address this, we propose a multi-concept personalization framework built upon the DiT architecture. We first discover word-level semantic locality in DiT’s modulation space and accordingly design a directional optimization mechanism for word-embedding alignment. Furthermore, we introduce a dual-path textual control scheme—combining attention modulation with shift/scale-based modulation—to achieve cross-image concept disentanglement and plug-and-play composition from single-image, few-shot inputs. Our method significantly outperforms state-of-the-art approaches on multi-concept compositional generation tasks, enabling fine-grained, controllable synthesis of objects, accessories, materials, poses, and lighting. It demonstrates strong generalization and high-fidelity editing accuracy.

Technology Category

Application Category

📝 Abstract
We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/
Problem

Research questions and friction points this paper is trying to address.

Text-to-Image Conversion
Personalized Element Adjustment
Complex Element Control
Innovation

Methods, ideas, or system contributions that make the work stand out.

TokenVerse
Text-to-Image Generation
Personalized Image Adjustment
🔎 Similar Papers