MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video diffusion Transformer (DiT) caching methods operate at a single granularity, failing to simultaneously optimize generation quality and inference speed. This work proposes MixCache—a training-free caching framework enabling collaborative multi-granularity optimization. Its core contributions are threefold: (1) a context-aware cache triggering mechanism; (2) an adaptive hybrid decision strategy that dynamically selects the optimal caching granularity—per-step, per-classifier-free-guidance (CFG) scale, or per-module—based on redundancy analysis; and (3) the first joint scheduling scheme for multi-granularity caches. Evaluated on Wan 14B and HunyuanVideo, MixCache achieves 1.94× and 1.97× inference speedup, respectively, while preserving or even improving generation fidelity. By decoupling granularity selection from fixed architectural assumptions and enabling fine-grained reuse, MixCache significantly enhances both the efficiency and practical deployability of video DiT models.

Technology Category

Application Category

📝 Abstract
Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94$ imes$ speedup on Wan 14B, 1.97$ imes$ speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.
Problem

Research questions and friction points this paper is trying to address.

High computational cost in video DiT models
Limited single-granularity caching strategies
Balancing generation quality and inference speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free caching framework for video DiT
Context-aware cache triggering strategy
Adaptive hybrid cache decision strategy
🔎 Similar Papers
Y
Yuanxin Wei
Sun Yat-sen University
L
Lansong Diao
Alibaba Group
B
Bujiao Chen
Alibaba Group
Shenggan Cheng
Shenggan Cheng
National University of Singapore
Machine Learning SystemsHigh Performance ComputingDeep Learning
Zhengping Qian
Zhengping Qian
Alibaba Group, Microsoft Research
Distributed systems
Wenyuan Yu
Wenyuan Yu
Alibaba Group
Graph computationdata managementdistributed systems and parallel computation
N
Nong Xiao
Sun Yat-sen University
W
Wei Lin
Alibaba Group
J
Jiangsu Du
Sun Yat-sen University