Beyond Few-Step Inference: Accelerating Video Diffusion Transformer Model Serving with Inter-Request Caching Reuse

📅 2026-04-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion Transformers suffer from high inference overhead due to iterative denoising, and existing caching methods—limited to exploiting intra-request similarity—show marginal gains under few-step distilled models. This work proposes Chorus, the first approach to introduce cross-request caching into video diffusion model serving. Chorus employs a three-stage caching strategy to reuse latent features from semantically similar requests and incorporates a token-guided attention mechanism to enhance semantic alignment, thereby improving cache applicability. It enables region-level intermediate-layer caching, extending full feature reuse to later denoising steps. Evaluated on an industrial-grade 4-step distilled DiT model, Chorus achieves up to 45% end-to-end speedup, substantially outperforming current single-request caching schemes.
📝 Abstract
Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.
Problem

Research questions and friction points this paper is trying to address.

Video Diffusion Transformer
inference cost
inter-request caching
denoising steps
model serving
Innovation

Methods, ideas, or system contributions that make the work stand out.

inter-request caching
video diffusion transformer
denoising acceleration
latent feature reuse
Token-Guided Attention Amplification
🔎 Similar Papers
No similar papers found.
Hao Liu
Hao Liu
University of Electronic Science and Technology of China
RISstacked intelligent metasurfaceDRL
Ye Huang
Ye Huang
Research Professor, East China Normal University
Environmental Science
C
Chenghuan Huang
Independent Researcher
Z
Zhenyi Zheng
Sun Yat-sen University, China
J
Jiangsu Du
Sun Yat-sen University, China
Z
Ziyang Ma
Independent Researcher
Jing Lyu
Jing Lyu
Shanghai Jiao Tong University
Power electronicsstabilityrenewable energy grid integrationhigh-voltage dc transmission
Y
Yutong Lu
Sun Yat-sen University, China