🤖 AI Summary
Diffusion-based video generation suffers from high computational overhead and slow inference, especially for high-resolution and long-duration videos; existing acceleration methods often compromise visual quality. To address this, we propose a Sketching-Rendering two-stage collaborative inference framework: a large DiT model handles the high-noise regime to ensure semantic consistency and motion fidelity, while a compact DiT model specializes in the low-noise regime to refine visual details. This paradigm introduces the first noise-stage decoupled scheduling strategy and heterogeneous DiT model specialization—orthogonally compatible with step-skipping techniques. On benchmarks including VBench, our method achieves near-lossless quality, accelerating inference by 3× over Wan and 2× over CogVideoX, significantly enhancing efficiency for long, high-definition video generation.
📝 Abstract
Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$ imes$ speedup for Wan with nearly no quality loss for VBench, and 2$ imes$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.