Motion-Aware Caching for Efficient Autoregressive Video Generation

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Autoregressive video generation remains impractical due to the high computational cost of iterative per-frame denoising, and existing cache reuse methods suffer from coarse granularity that fails to capture pixel-level motion dynamics. This work proposes MotionCache, a novel framework that introduces pixel-level motion awareness into the caching mechanism for the first time. Leveraging inter-frame differences as a lightweight motion proxy, MotionCache employs a coarse-to-fine strategy: it first establishes semantic consistency during a warm-up phase and then dynamically schedules cache update frequencies for individual tokens based on local motion intensity. Theoretical analysis reveals a critical link between cache-induced errors and residual instability. Experiments demonstrate significant acceleration—6.28× on SkyReels-V2 and 1.64× on MAGI-1—with negligible quality degradation of only 1% and 0.01% on VBench, respectively.
📝 Abstract
Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.
Problem

Research questions and friction points this paper is trying to address.

autoregressive video generation
motion-aware caching
cache reuse
pixel dynamics
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

motion-aware caching
autoregressive video generation
cache reuse
pixel-level motion
residual instability
Jing Xu
Jing Xu
Unknown affiliation
Y
Yuexiao Ma
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
S
Songwei Liu
ByteDance
X
Xuzhe Zheng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Shiwei Liu
Shiwei Liu
ELLIS Institute Tübingen, MPI for Intelligent Systems, University of Oxford, CS@TU/e
Machine LearningDeep LearningLow-dimentional LearningLLMs
Chenqian Yan
Chenqian Yan
Xiamen University
Model Compression
Xiawu Zheng
Xiawu Zheng
Associate Professor, IEEE Senior Member, Xiamen University
Automated Machine LearningNetwork CompressionNeural Architecture SearchAutoML
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
F
Fei Chao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Xing Wang
Xing Wang
ByteDance
image processingdeep learningcomputer vision