🤖 AI Summary
Video diffusion Transformers suffer from high inference overhead due to iterative denoising, and existing caching methods—limited to exploiting intra-request similarity—show marginal gains under few-step distilled models. This work proposes Chorus, the first approach to introduce cross-request caching into video diffusion model serving. Chorus employs a three-stage caching strategy to reuse latent features from semantically similar requests and incorporates a token-guided attention mechanism to enhance semantic alignment, thereby improving cache applicability. It enables region-level intermediate-layer caching, extending full feature reuse to later denoising steps. Evaluated on an industrial-grade 4-step distilled DiT model, Chorus achieves up to 45% end-to-end speedup, substantially outperforming current single-request caching schemes.
📝 Abstract
Video Diffusion Transformer (DiT) models are a dominant approach for high-quality video generation but suffer from high inference cost due to iterative denoising. Existing caching approaches primarily exploit similarity within the diffusion process of a single request to skip redundant denoising steps. In this paper, we introduce Chorus, a caching approach that leverages similarity across requests to accelerate video diffusion model serving. Chorus achieves up to 45\% speedup on industrial 4-step distilled models, where prior intra-request caching approaches are ineffective. Particularly, Chorus employs a three-stage caching strategy along the denoising process. Stage 1 performs full reuse of latent features from similar requests. Stage 2 exploits inter-request caching in specific latent regions during intermediate denoising steps. This stage is combined with Token-Guided Attention Amplification to improve semantic alignment between the generated video and the conditional prompts, thereby extending the applicability of full reuse to later denoising steps.