🤖 AI Summary
Addressing the challenge of designing sparse attention for 2D image generation in diffusion models—where balancing spatial locality and GPU memory access efficiency remains difficult—this paper proposes HilbertA, a hardware-friendly 2D sparse attention mechanism. Its core innovation lies in applying the Hilbert curve to spatially preservingly reorder image tokens, enabling efficient locality-aware computation while supporting long-range interactions and inter-block communication via a sliding-window scheduling strategy and a center-shared region design. All components are implemented in Triton to ensure memory-contiguous access and compute-aligned GPU acceleration. Experiments demonstrate that HilbertA achieves 2.3× and 4.17× speedups in attention computation at 1024×1024 and 2048×2048 resolutions, respectively, while maintaining image quality on par with or superior to state-of-the-art sparse attention methods.
📝 Abstract
Designing sparse attention for diffusion transformers requires reconciling two-dimensional spatial locality with GPU efficiency, a trade-off that current methods struggle to achieve. Existing approaches enforce two-dimensional spatial locality but often incur uncoalesced memory access. We present HilbertA, a 2D-aware and GPU-efficient sparse attention mechanism. HilbertA reorders image tokens along Hilbert curves to achieve a contiguous memory layout while preserving spatial neighborhoods, and employs a sliding schedule across layers to enable long-range information propagation without repeated or uncoalesced memory access. To further enhance cross-tile communication and positional awareness, HilbertA introduces a small central shared region. Implemented in Triton, HilbertA delivers comparable image quality with significant acceleration over prior methods on Flux.1-dev, demonstrating the feasibility of hardware-aligned two-dimensional sparse attention for high-resolution image generation. HilbertA delivers attention speedups of $2.3 imes$ when generating $1024 imes 1024$ images, and up to $4.17 imes$ at $2048 imes 2048$, while achieving image quality comparable to or surpassing baselines.