Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address throughput and energy-efficiency bottlenecks of dynamic sparse attention in long-sequence LLM inference under large-token parallel processing (LTPP), this work departs from conventional stage-isolated optimization, proposing the first cross-stage co-optimized compute-memory framework. Key innovations include: (1) logarithmic-domain addition with leading-zero-based sparse prediction; (2) distributed sorting-guided FlashAttention update; and (3) coordinated tiling for fine-grained inter-stage interaction. Through algorithm–hardware co-design—integrating a customized STAR accelerator and a multi-core spatial architecture—we achieve 9.2× higher throughput and 71.2× better energy efficiency on A100 versus baseline dense attention. Against state-of-the-art accelerators, our design improves energy efficiency and area efficiency by 16.1× and 27.1×, respectively. The Spatial-STAR architecture delivers 20.1× throughput gain.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$ imes$ speedup and 71.2$ imes$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$ imes$ energy and 27.1$ imes$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$ imes$ throughput improvement.

Problem

Research questions and friction points this paper is trying to address.

Optimizes sparse attention computation for large language models

Reduces redundant operations in high-throughput inference scenarios

Improves memory efficiency for long sequence processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-stage coordination reduces computation and memory access

Leading-zero-based sparsity prediction minimizes prediction overhead

Coordinated tiling strategy enables fine-grained stage interaction

🔎 Similar Papers

No similar papers found.