SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
Diffusion-based video generation suffers from high computational overhead and slow inference, especially for high-resolution and long-duration videos; existing acceleration methods often compromise visual quality. To address this, we propose a Sketching-Rendering two-stage collaborative inference framework: a large DiT model handles the high-noise regime to ensure semantic consistency and motion fidelity, while a compact DiT model specializes in the low-noise regime to refine visual details. This paradigm introduces the first noise-stage decoupled scheduling strategy and heterogeneous DiT model specialization—orthogonally compatible with step-skipping techniques. On benchmarks including VBench, our method achieves near-lossless quality, accelerating inference by 3× over Wan and 2× over CogVideoX, significantly enhancing efficiency for long, high-definition video generation.

Technology Category

Application Category

📝 Abstract
Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$ imes$ speedup for Wan with nearly no quality loss for VBench, and 2$ imes$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.
Problem

Research questions and friction points this paper is trying to address.

Accelerate video diffusion inference with minimal quality loss
Reduce computational cost for high-resolution, long-duration videos
Balance semantic fidelity and visual detail refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large and small model collaboration
Sketching ensures semantic and motion fidelity
Rendering refines visual details efficiently
🔎 Similar Papers