🤖 AI Summary
Visual concept composition faces challenges including inaccurate cross-modal (image/video) concept extraction, difficulty in disentangling compositional elements, and inflexible fusion mechanisms. To address these, we propose a hierarchical visual concept composition framework tailored for diffusion Transformers. Our method introduces a Binder module to explicitly bind visual concepts with prompt tokens; incorporates a Diversify-and-Absorb mechanism to suppress irrelevant details; and employs Temporal Disentanglement—implemented via a dual-branch cross-modal conditional encoding—to decouple temporal dynamics in video generation. Crucially, the framework enables complex concept disentanglement and cross-source (image/video) fusion within a single forward pass. Extensive evaluation demonstrates significant improvements over state-of-the-art methods in concept consistency, prompt fidelity, and motion quality, thereby substantially enhancing both controllability and creativity in generative modeling.
📝 Abstract
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.