TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Current video diffusion Transformers (vDiTs) suffer from high inference latency due to redundant self-attention computations. Moreover, prevailing sparsification methods—borrowed naively from language models—overlook the intrinsic spatiotemporal correlations inherent in video data. To address this, we propose a lightweight, adaptive channel-wise attention score reuse strategy: by modeling token-level spatiotemporal dependencies in the latent space, our method selectively reuses partial attention scores from spatially or temporally adjacent tokens, thereby avoiding full attention computation. Crucially, it requires no predefined sparsity pattern, ensuring both generality and efficiency. Evaluated on four vDiT architectures, our approach achieves up to 85% FLOPs reduction while incurring less than 0.06 absolute degradation in VBench quality—a substantial improvement over existing acceleration techniques.

Technology Category

Application Category

📝 Abstract

The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06% loss on VBench).

Problem

Research questions and friction points this paper is trying to address.

Accelerating video diffusion transformers by reducing self-attention computational overhead

Leveraging latent space spatio-temporal correlations to optimize attention mechanisms

Developing adaptive reuse strategy for attention scores while preserving video quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging spatio-temporal correlations in latent space

Adaptive reuse strategy for attention computations

Reusing partial scores of correlated tokens

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation