Composing Concepts from Images and Videos via Concept-prompt Binding

📅 2025-12-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Visual concept composition faces challenges including inaccurate cross-modal (image/video) concept extraction, difficulty in disentangling compositional elements, and inflexible fusion mechanisms. To address these, we propose a hierarchical visual concept composition framework tailored for diffusion Transformers. Our method introduces a Binder module to explicitly bind visual concepts with prompt tokens; incorporates a Diversify-and-Absorb mechanism to suppress irrelevant details; and employs Temporal Disentanglement—implemented via a dual-branch cross-modal conditional encoding—to decouple temporal dynamics in video generation. Crucially, the framework enables complex concept disentanglement and cross-source (image/video) fusion within a single forward pass. Extensive evaluation demonstrates significant improvements over state-of-the-art methods in concept consistency, prompt fidelity, and motion quality, thereby substantially enhancing both controllability and creativity in generative modeling.

Technology Category

Application Category

📝 Abstract

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.

Problem

Research questions and friction points this paper is trying to address.

Extracts complex visual concepts from images and videos

Flexibly combines concepts from both images and videos

Improves accuracy and compatibility in visual concept composition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical binder structure for cross-attention conditioning

Diversify-and-Absorb Mechanism to improve binding accuracy

Temporal Disentanglement Strategy for video concept compatibility

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts