QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

📅 2025-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion Transformers (DiTs) achieve state-of-the-art performance in video generation but suffer from prohibitive computational and memory overhead, hindering on-device deployment. To address their low inference efficiency and high memory footprint, we propose a training-agnostic, end-to-end acceleration framework. Our method introduces three synergistic optimizations: (1) hierarchical latent variable caching that exploits inter-layer dependencies; (2) importance-adaptive quantization—gradient-free importance estimation coupled with dynamic bit-width allocation; and (3) DiT architecture redundancy-aware pruning. Unlike prior approaches, our framework requires no fine-tuning or retraining. Evaluated on Open-Sora, it achieves a 6.72× reduction in end-to-end latency while preserving near-lossless generation quality—outperforming existing quantization- and caching-based methods. The proposed technique overcomes the limitations of isolated acceleration strategies, enabling efficient, high-fidelity DiT inference on resource-constrained devices.

Technology Category

Application Category

📝 Abstract
Recently, Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation, surpassing U-Net-based models in terms of performance. However, the enhanced capabilities of DiTs come with significant drawbacks, including increased computational and memory costs, which hinder their deployment on resource-constrained devices. Current acceleration techniques, such as quantization and cache mechanism, offer limited speedup and are often applied in isolation, failing to fully address the complexities of DiT architectures. In this paper, we propose QuantCache, a novel training-free inference acceleration framework that jointly optimizes hierarchical latent caching, adaptive importance-guided quantization, and structural redundancy-aware pruning. QuantCache achieves an end-to-end latency speedup of 6.72$ imes$ on Open-Sora with minimal loss in generation quality. Extensive experiments across multiple video generation benchmarks demonstrate the effectiveness of our method, setting a new standard for efficient DiT inference. The code and models will be available at https://github.com/JunyiWuCode/QuantCache.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational and memory costs in Diffusion Transformers for video generation.
Combines hierarchical latent caching, adaptive quantization, and redundancy-aware pruning.
Achieves significant latency speedup without compromising video generation quality.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical latent caching for efficient memory use
Adaptive importance-guided quantization for speedup
Structural redundancy-aware pruning to reduce complexity