🤖 AI Summary
Diffusion models commonly suffer from insufficient fidelity when sampling with moderate numbers of function evaluations (NFEs, 20–50), and existing acceleration methods either target extremely low NFEs (<10) or rely on model-specific architectural assumptions, compromising generality and quality. To address this, we propose STORK—a training-free, architecture-agnostic ODE solver that for the first time integrates stiffness-aware ODE solving principles with adaptive Taylor expansion to construct a stable, orthogonal Runge–Kutta scheme. STORK is agnostic to model parametrization and unifies support for both noise-prediction and flow-matching paradigms without assuming semi-linear structure. Evaluated on state-of-the-art models—including Stable Diffusion 3.5, SANA, and FLUX—STORK achieves substantial FID reduction and improved image fidelity across the 20–50 NFE regime. The implementation is publicly available.
📝 Abstract
Diffusion models (DMs) have demonstrated remarkable performance in high-fidelity image and video generation. Because high-quality generations with DMs typically require a large number of function evaluations (NFEs), resulting in slow sampling, there has been extensive research successfully reducing the NFE to a small range (<10) while maintaining acceptable image quality. However, many practical applications, such as those involving Stable Diffusion 3.5, FLUX, and SANA, commonly operate in the mid-NFE regime (20-50 NFE) to achieve superior results, and, despite the practical relevance, research on the effective sampling within this mid-NFE regime remains underexplored. In this work, we propose a novel, training-free, and structure-independent DM ODE solver called the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method, based on a class of stiff ODE solvers with a Taylor expansion adaptation. Unlike prior work such as DPM-Solver, which is dependent on the semi-linear structure of the DM ODE, STORK is applicable to any DM sampling, including noise-based and flow matching-based models. Within the 20-50 NFE range, STORK achieves improved generation quality, as measured by FID scores, across unconditional pixel-level generation and conditional latent-space generation tasks using models like Stable Diffusion 3.5 and SANA. Code is available at https://github.com/ZT220501/STORK.