🤖 AI Summary
This work addresses the challenge in multi-event text-to-video generation where high action fidelity and temporal consistency are difficult to achieve simultaneously, primarily due to temporal misalignment between prompts and video content and attention conflicts between moving objects and textual conditions. To resolve this, the authors propose TS-Attn, a plug-and-play, training-free attention mechanism that introduces temporally separable attention for the first time. By dynamically reshaping attention distributions and integrating a temporal alignment strategy with decoupled text-visual attention, TS-Attn enhances temporal awareness and global coherence without altering the pre-trained model architecture. Evaluated on Wan2.1-T2V-14B and Wan2.2-T2V-A14B, the method improves StoryEval-Bench scores by 33.5% and 16.4%, respectively, with only a 2% increase in inference time, and is also applicable to image-to-video generation tasks.
📝 Abstract
Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.