TIMERIPPLE: Accelerating vDiTs by Understanding the Spatio-Temporal Correlations in Latent Space

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video diffusion Transformers (vDiTs) suffer from high inference latency due to redundant self-attention computations. Moreover, prevailing sparsification methods—borrowed naively from language models—overlook the intrinsic spatiotemporal correlations inherent in video data. To address this, we propose a lightweight, adaptive channel-wise attention score reuse strategy: by modeling token-level spatiotemporal dependencies in the latent space, our method selectively reuses partial attention scores from spatially or temporally adjacent tokens, thereby avoiding full attention computation. Crucially, it requires no predefined sparsity pattern, ensuring both generality and efficiency. Evaluated on four vDiT architectures, our approach achieves up to 85% FLOPs reduction while incurring less than 0.06 absolute degradation in VBench quality—a substantial improvement over existing acceleration techniques.

Technology Category

Application Category

📝 Abstract
The recent surge in video generation has shown the growing demand for high-quality video synthesis using large vision models. Existing video generation models are predominantly based on the video diffusion transformer (vDiT), however, they suffer from substantial inference delay due to self-attention. While prior studies have focused on reducing redundant computations in self-attention, they often overlook the inherent spatio-temporal correlations in video streams and directly leverage sparsity patterns from large language models to reduce attention computations. In this work, we take a principled approach to accelerate self-attention in vDiTs by leveraging the spatio-temporal correlations in the latent space. We show that the attention patterns within vDiT are primarily due to the dominant spatial and temporal correlations at the token channel level. Based on this insight, we propose a lightweight and adaptive reuse strategy that approximates attention computations by reusing partial attention scores of spatially or temporally correlated tokens along individual channels. We demonstrate that our method achieves significantly higher computational savings (85%) compared to state-of-the-art techniques over 4 vDiTs, while preserving almost identical video quality ($<$0.06% loss on VBench).
Problem

Research questions and friction points this paper is trying to address.

Accelerating video diffusion transformers by reducing self-attention computational overhead
Leveraging latent space spatio-temporal correlations to optimize attention mechanisms
Developing adaptive reuse strategy for attention scores while preserving video quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging spatio-temporal correlations in latent space
Adaptive reuse strategy for attention computations
Reusing partial scores of correlated tokens
🔎 Similar Papers
No similar papers found.
W
Wenxuan Miao
Shanghai Jiao Tong University
Y
Yulin Sun
Shanghai Jiao Tong University
A
Aiyue Chen
Huawei Technologies Co.,Ltd
J
Jing Lin
Huawei Technologies Co.,Ltd
Yiwu Yao
Yiwu Yao
Peking University
Artificial Intelligence
Y
Yiming Gan
Institute of Computing Technology, Chinese Academy of Science
Jieru Zhao
Jieru Zhao
Associate Professor, Shanghai Jiao Tong University
Hardware-software co-designAI acceleration and systemCompilerFPGAHigh-level synthesis
Jingwen Leng
Jingwen Leng
Professor, Shanghai Jiao Tong University
Computer Architecture
M
Mingyi Guo
Shanghai Jiao Tong University, Shanghai Qizhi Institute
Y
Yu Feng
Shanghai Jiao Tong University, Shanghai Qizhi Institute