BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video generation models struggle to maintain subject consistency under complex prompts, particularly in modeling spatial relationships among multiple subjects, temporal logic, and interactive behaviors. To address this, we propose BindWeave—a novel framework that achieves deep semantic-visual alignment between textual entities and visual subjects for the first time. BindWeave leverages a multimodal large language model (MLLM) for entity binding and role disentanglement, and integrates it with a diffusion Transformer (DiT) to form an MLLM-DiT joint architecture. This enables generation of subject-aware latent representations, facilitating cross-modal controllable video synthesis. The method supports high-fidelity video generation across diverse scenarios—from single-subject to multi-subject interactions. Evaluated on the OpenS2V benchmark, BindWeave consistently outperforms existing open-source and commercial models, significantly improving subject consistency, visual naturalness, and text-video alignment.

Technology Category

Application Category

📝 Abstract
Diffusion Transformer has shown remarkable abilities in generating high-fidelity videos, delivering visually coherent frames and rich details over extended durations. However, existing video generation models still fall short in subject-consistent video generation due to an inherent difficulty in parsing prompts that specify complex spatial relationships, temporal logic, and interactions among multiple subjects. To address this issue, we propose BindWeave, a unified framework that handles a broad range of subject-to-video scenarios from single-subject cases to complex multi-subject scenes with heterogeneous entities. To bind complex prompt semantics to concrete visual subjects, we introduce an MLLM-DiT framework in which a pretrained multimodal large language model performs deep cross-modal reasoning to ground entities and disentangle roles, attributes, and interactions, yielding subject-aware hidden states that condition the diffusion transformer for high-fidelity subject-consistent video generation. Experiments on the OpenS2V benchmark demonstrate that our method achieves superior performance across subject consistency, naturalness, and text relevance in generated videos, outperforming existing open-source and commercial models.
Problem

Research questions and friction points this paper is trying to address.

Addressing subject inconsistency in multi-subject video generation
Resolving complex spatial-temporal relationship parsing in prompts
Enabling cross-modal integration for entity-consistent video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-DiT framework enables cross-modal reasoning
Ground entities and interactions via subject-aware states
Diffusion transformer generates subject-consistent videos
🔎 Similar Papers
No similar papers found.