TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenge in multi-event text-to-video generation where high action fidelity and temporal consistency are difficult to achieve simultaneously, primarily due to temporal misalignment between prompts and video content and attention conflicts between moving objects and textual conditions. To resolve this, the authors propose TS-Attn, a plug-and-play, training-free attention mechanism that introduces temporally separable attention for the first time. By dynamically reshaping attention distributions and integrating a temporal alignment strategy with decoupled text-visual attention, TS-Attn enhances temporal awareness and global coherence without altering the pre-trained model architecture. Evaluated on Wan2.1-T2V-14B and Wan2.2-T2V-A14B, the method improves StoryEval-Bench scores by 33.5% and 16.4%, respectively, with only a 2% increase in inference time, and is also applicable to image-to-video generation tasks.

Technology Category

Application Category

📝 Abstract

Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.

Problem

Research questions and friction points this paper is trying to address.

multi-event video generation

temporal consistency

prompt-following capability

text-to-video synthesis

temporal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal-wise Separable Attention

multi-event video generation

attention mechanism