FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Multimodal diffusion Transformers (DiTs) achieve state-of-the-art visual synthesis quality but suffer from prohibitive computational overhead, hindering practical deployment; existing sparsification methods rely on architecture-specific kernels, limiting generality. This work proposes a Unified Sparse Attention Engine that abstracts diverse sparsity patterns via a flexible symbolic representation, enabling a single kernel to support efficient inference across arbitrary DiT architectures. Integrated optimizations—including feature caching, block-sparse skip connections, and optimized sparse GEMM—significantly reduce redundant computation. Experiments demonstrate near-linear speedup for attention and GEMM-Q (≈1:1), 2.5×–3.8× acceleration for GEMM-O (reaching up to 87.5% of the theoretical peak), and 1.5× end-to-end inference speedup—all without compromising generation fidelity. The core contribution is the first general-purpose, high-performance, quality-preserving sparse inference framework specifically designed for DiTs.

Technology Category

Application Category

📝 Abstract

Multi-Modal Diffusion Transformers (DiTs) demonstrate exceptional capabilities in visual synthesis, yet their deployment remains constrained by substantial computational demands. To alleviate this bottleneck, many sparsity-based acceleration methods have been proposed. However, their diverse sparsity patterns often require customized kernels for high-performance inference, limiting universality. We propose FlashOmni, a unified sparse attention engine compatible with arbitrary DiT architectures. FlashOmni introduces flexible sparse symbols to standardize the representation of a wide range of sparsity strategies, such as feature caching and block-sparse skipping. This unified abstraction enables the execution of diverse sparse computations within a single attention kernel. In addition, FlashOmni designs optimized sparse GEMMs for attention blocks, leveraging sparse symbols to eliminate redundant computations and further improve efficiency. Experiments demonstrate that FlashOmni delivers near-linear, closely matching the sparsity ratio speedup (1:1) in attention and GEMM-$Q$, and achieves 2.5$ imes$-3.8$ imes$ acceleration in GEMM-$O$ (max peaking at about 87.5% of the theoretical limit). Applied with a multi-granularity sparsity strategy, it enables the Hunyuan model (33K) to achieve about 1.5$ imes$ end-to-end acceleration without degrading visual quality.

Problem

Research questions and friction points this paper is trying to address.

Accelerating multi-modal diffusion transformers with computational bottlenecks

Unifying diverse sparsity patterns for universal DiT architecture compatibility

Optimizing sparse attention computations while maintaining visual synthesis quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified sparse attention engine for diverse DiT architectures

Flexible sparse symbols standardize various sparsity strategies

Optimized sparse GEMMs eliminate redundant computations efficiently

🔎 Similar Papers

No similar papers found.

ByteDance

西雅图

Sr. Multimodal Model Training and Inference Optimization Engineer

ByteDance

圣何塞

Authors to Follow