SRDiffusion: Accelerate Video Diffusion Inference via Sketching-Rendering Cooperation

📅 2025-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based video generation suffers from high computational overhead and slow inference, especially for high-resolution and long-duration videos; existing acceleration methods often compromise visual quality. To address this, we propose a Sketching-Rendering two-stage collaborative inference framework: a large DiT model handles the high-noise regime to ensure semantic consistency and motion fidelity, while a compact DiT model specializes in the low-noise regime to refine visual details. This paradigm introduces the first noise-stage decoupled scheduling strategy and heterogeneous DiT model specialization—orthogonally compatible with step-skipping techniques. On benchmarks including VBench, our method achieves near-lossless quality, accelerating inference by 3× over Wan and 2× over CogVideoX, significantly enhancing efficiency for long, high-definition video generation.

Technology Category

Application Category

📝 Abstract
Leveraging the diffusion transformer (DiT) architecture, models like Sora, CogVideoX and Wan have achieved remarkable progress in text-to-video, image-to-video, and video editing tasks. Despite these advances, diffusion-based video generation remains computationally intensive, especially for high-resolution, long-duration videos. Prior work accelerates its inference by skipping computation, usually at the cost of severe quality degradation. In this paper, we propose SRDiffusion, a novel framework that leverages collaboration between large and small models to reduce inference cost. The large model handles high-noise steps to ensure semantic and motion fidelity (Sketching), while the smaller model refines visual details in low-noise steps (Rendering). Experimental results demonstrate that our method outperforms existing approaches, over 3$ imes$ speedup for Wan with nearly no quality loss for VBench, and 2$ imes$ speedup for CogVideoX. Our method is introduced as a new direction orthogonal to existing acceleration strategies, offering a practical solution for scalable video generation.
Problem

Research questions and friction points this paper is trying to address.

Accelerate video diffusion inference with minimal quality loss
Reduce computational cost for high-resolution, long-duration videos
Balance semantic fidelity and visual detail refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large and small model collaboration
Sketching ensures semantic and motion fidelity
Rendering refines visual details efficiently
🔎 Similar Papers
No similar papers found.
Shenggan Cheng
Shenggan Cheng
National University of Singapore
Machine Learning SystemsHigh Performance ComputingDeep Learning
Y
Yuanxin Wei
Sun Yat-sen University
L
Lansong Diao
Alibaba Group
Y
Yong Liu
National University of Singapore
B
Bujiao Chen
Alibaba Group
Lianghua Huang
Lianghua Huang
Tongyi Lab
generative modeling
Y
Yu Liu
Alibaba Group
Wenyuan Yu
Wenyuan Yu
Alibaba Group
Graph computationdata managementdistributed systems and parallel computation
J
Jiangsu Du
Sun Yat-sen University
W
Wei Lin
Alibaba Group
Yang You
Yang You
Postdoc, Stanford University
3D visioncomputer graphicscomputational geometry