🤖 AI Summary
This work addresses the high computational cost of multi-step denoising in autoregressive video diffusion models, where existing training-free acceleration methods are constrained by a binary choice between caching and recomputation, struggling to efficiently handle intermediate cases and leading to redundancy through uniform treatment of effective frames under asynchronous scheduling. To overcome these limitations, we propose SCOPE, a training-free framework that introduces a tri-modal scheduling strategy—caching, prediction, and recomputation—augmented with selective computation. Our approach bridges the gap between caching and recomputation via Taylor extrapolation at the noise level and ensures stability through error propagation analysis. SCOPE is the first to integrate tri-modal scheduling with an extrapolation-based prediction mechanism, achieving up to 4.73× acceleration on MAGI-1 and SkyReels-V2 while preserving original video quality and outperforming all existing training-free baselines.
📝 Abstract
Autoregressive (AR) video diffusion models enable long-form video generation but remain expensive due to repeated multi-step denoising. Existing training-free acceleration methods rely on binary cache-or-recompute decisions, overlooking intermediate cases where direct reuse is too coarse yet full recomputation is unnecessary. Moreover, asynchronous AR schedules assign different noise levels to co-generated frames, yet existing methods process the entire valid interval uniformly. To address these AR-specific inefficiencies, we present SCOPE, a training-free framework for efficient AR video diffusion. SCOPE introduces a tri-modal scheduler over cache, predict, and recompute, where prediction via noise-level Taylor extrapolation fills the gap between reuse and recomputation with explicit stability controls backed by error propagation analysis. It further introduces selective computation that restricts execution to the active frame interval. On MAGI-1 and SkyReels-V2, SCOPE achieves up to 4.73x speedup while maintaining quality comparable to the original output, outperforming all training-free baselines.