🤖 AI Summary
Multimodal diffusion Transformers (DiTs) achieve state-of-the-art visual synthesis quality but suffer from prohibitive computational overhead, hindering practical deployment; existing sparsification methods rely on architecture-specific kernels, limiting generality. This work proposes a Unified Sparse Attention Engine that abstracts diverse sparsity patterns via a flexible symbolic representation, enabling a single kernel to support efficient inference across arbitrary DiT architectures. Integrated optimizations—including feature caching, block-sparse skip connections, and optimized sparse GEMM—significantly reduce redundant computation. Experiments demonstrate near-linear speedup for attention and GEMM-Q (≈1:1), 2.5×–3.8× acceleration for GEMM-O (reaching up to 87.5% of the theoretical peak), and 1.5× end-to-end inference speedup—all without compromising generation fidelity. The core contribution is the first general-purpose, high-performance, quality-preserving sparse inference framework specifically designed for DiTs.
📝 Abstract
Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$ imes$-3.8$ imes$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$ imes$ end-to-end acceleration without degrading visual quality.