Mask$^2$DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of fine-grained text-video alignment and cross-segment visual inconsistency in multi-scene long-video generation, this paper proposes Dual-Mask Diffusion Transformer (Dual-Mask DiT). Methodologically, it introduces a novel synergistic mechanism of symmetric binary masking and segment-level conditional masking to achieve precise segment-wise text-to-video alignment within the DiT architecture; it further integrates text-visual cross-attention with an autoregressive scene expansion strategy to ensure temporal coherence. The core contribution is the first joint modeling of segment-level semantic alignment and long-term temporal consistency within a diffusion Transformer framework. Experiments on multi-scene video generation demonstrate that our method significantly improves both cross-segment visual consistency and semantic alignment accuracy. Both qualitative and quantitative evaluations outperform existing state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation. However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored. To bridge this gap, we propose Mask$^2$DiT, a novel approach that establishes fine-grained, one-to-one alignment between video segments and their corresponding text annotations. Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture, ensuring that each text annotation applies exclusively to its respective video segment while preserving temporal coherence across visual tokens. This attention mechanism enables precise segment-level textual-to-visual alignment, allowing the DiT architecture to effectively handle video generation tasks with a fixed number of scenes. To further equip the DiT architecture with the ability to generate additional scenes based on existing ones, we incorporate a segment-level conditional mask, which conditions each newly generated segment on the preceding video segments, thereby enabling auto-regressive scene extension. Both qualitative and quantitative experiments confirm that Mask$^2$DiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description. Our project page is https://tianhao-qi.github.io/Mask2DiTProject.
Problem

Research questions and friction points this paper is trying to address.

Enables multi-scene video generation with text alignment
Ensures segment-level textual-to-visual coherence in videos
Supports auto-regressive scene extension for longer videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual mask-based attention for segment alignment
Segment-level conditional mask for auto-regressive extension
Fine-grained text-video alignment in DiT architecture
🔎 Similar Papers
No similar papers found.
Tianhao Qi
Tianhao Qi
PhD, University of Science and Technology of China
cross-modal generationobject detection
J
Jianlong Yuan
Bytedance Intelligent Creation
Wanquan Feng
Wanquan Feng
USTC
computer vision
Shancheng Fang
Shancheng Fang
USTC
J
Jiawei Liu
Bytedance Intelligent Creation
S
SiYu Zhou
Bytedance Intelligent Creation
Qian He
Qian He
ByteDance
H
Hongtao Xie
University of Science and Technology of China
Y
Yongdong Zhang
University of Science and Technology of China