FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal diffusion Transformers (DiTs) achieve state-of-the-art visual synthesis quality but suffer from prohibitive computational overhead, hindering practical deployment; existing sparsification methods rely on architecture-specific kernels, limiting generality. This work proposes a Unified Sparse Attention Engine that abstracts diverse sparsity patterns via a flexible symbolic representation, enabling a single kernel to support efficient inference across arbitrary DiT architectures. Integrated optimizations—including feature caching, block-sparse skip connections, and optimized sparse GEMM—significantly reduce redundant computation. Experiments demonstrate near-linear speedup for attention and GEMM-Q (≈1:1), 2.5×–3.8× acceleration for GEMM-O (reaching up to 87.5% of the theoretical peak), and 1.5× end-to-end inference speedup—all without compromising generation fidelity. The core contribution is the first general-purpose, high-performance, quality-preserving sparse inference framework specifically designed for DiTs.

Technology Category

Application Category

📝 Abstract
Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$ imes$-3.8$ imes$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$ imes$ end-to-end acceleration without degrading visual quality.
Problem

Research questions and friction points this paper is trying to address.

Accelerating multi-modal diffusion transformers with computational bottlenecks
Unifying diverse sparsity patterns for universal DiT architecture compatibility
Optimizing sparse attention computations while maintaining visual synthesis quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified sparse attention engine for diverse DiT architectures
Flexible sparse symbols standardize various sparsity strategies
Optimized sparse GEMMs eliminate redundant computations efficiently
🔎 Similar Papers
No similar papers found.
L
Liang Qiao
University of Science and Technology of China
Y
Yue Dai
University of Science and Technology of China
Yeqi Huang
Yeqi Huang
University of Edinburgh
ServerlessAI
H
Hongyu Kan
University of Virginia
J
Jun Shi
University of Science and Technology of China
H
Hong An
University of Science and Technology of China