🤖 AI Summary
Current video generation models struggle to maintain subject consistency under complex prompts, particularly in modeling spatial relationships among multiple subjects, temporal logic, and interactive behaviors. To address this, we propose BindWeave—a novel framework that achieves deep semantic-visual alignment between textual entities and visual subjects for the first time. BindWeave leverages a multimodal large language model (MLLM) for entity binding and role disentanglement, and integrates it with a diffusion Transformer (DiT) to form an MLLM-DiT joint architecture. This enables generation of subject-aware latent representations, facilitating cross-modal controllable video synthesis. The method supports high-fidelity video generation across diverse scenarios—from single-subject to multi-subject interactions. Evaluated on the OpenS2V benchmark, BindWeave consistently outperforms existing open-source and commercial models, significantly improving subject consistency, visual naturalness, and text-video alignment.
📝 Abstract
Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.