🤖 AI Summary
Existing cache-acceleration methods for diffusion models rely on predefined heuristics or dataset-specific priors, suffering from poor generalizability and limited robustness to anomalous samples. To address this, we propose DiCache—a training-free, fully adaptive runtime caching strategy. We first identify a strong empirical correlation between shallow-layer feature discrepancies and final output degradation, enabling lightweight online probing to predict cache-induced reconstruction error in real time. Leveraging this insight, DiCache dynamically determines *when* to cache, *how* to align denoising trajectories across cached and current steps, and *how* to weight multi-step cached features during fusion. Crucially, DiCache introduces no additional training overhead and is plug-and-play compatible with mainstream diffusion models—including WAN 2.1, HunyuanVideo, and Flux—delivering substantial inference speedup (up to 2.1×) while preserving or even enhancing visual fidelity of generated images.
📝 Abstract
Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine the timing of caching and utilizing handcrafted rules for leveraging multi-step caches. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail on outlier samples. In this paper, a strong correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of final model outputs. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain a stable prior for the caching error in real time, enabling the model to autonomously determine caching schedules. (2) Dynamic Cache Trajectory Alignment combines multi-step caches based on shallow-layer probe feature trajectory to better approximate the current feature, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved visual fidelity over state-of-the-art methods on various leading diffusion models including WAN 2.1, HunyuanVideo for video generation, and Flux for image generation.