Accelerating Diffusion Transformers with Token-wise Feature Caching

📅 2024-10-05
🏛️ arXiv.org
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
To address the high inference overhead and coarse-grained caching strategies in diffusion Transformers (DiTs) for image/video generation, this paper proposes a token-level feature caching mechanism. First, it models token-wise cache sensitivity by analyzing attention patterns to dynamically assess each token’s robustness to caching-induced approximation. Based on this sensitivity, it generates fine-grained cache masks and applies hierarchical, adaptive cache ratio scheduling across transformer layers. The method requires no model retraining—offering a plug-and-play, architecture-agnostic solution compatible with mainstream DiT variants. Evaluated on OpenSora and PixArt-α, it achieves 2.36× and 1.93× inference speedup, respectively, with negligible degradation in fidelity metrics (e.g., FID, LPIPS). Key contributions include: (i) the first formulation of token-level cache sensitivity modeling in DiTs; (ii) a hierarchical, sensitivity-driven cache scheduling strategy; and (iii) a zero-training, deployment-ready caching paradigm.

Technology Category

Application Category

📝 Abstract
Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$ imes$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$ imes$ and 1.93$ imes$ acceleration are achieved on OpenSora and PixArt-$alpha$ with almost no drop in generation quality.
Problem

Research questions and friction points this paper is trying to address.

Reduce computation costs in diffusion transformers.
Improve feature caching efficiency for tokens.
Enhance image and video generation quality.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise feature caching
Adaptive token selection
Layer-specific caching ratios
🔎 Similar Papers
No similar papers found.