Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address throughput and energy-efficiency bottlenecks of dynamic sparse attention in long-sequence LLM inference under large-token parallel processing (LTPP), this work departs from conventional stage-isolated optimization, proposing the first cross-stage co-optimized compute-memory framework. Key innovations include: (1) logarithmic-domain addition with leading-zero-based sparse prediction; (2) distributed sorting-guided FlashAttention update; and (3) coordinated tiling for fine-grained inter-stage interaction. Through algorithm–hardware co-design—integrating a customized STAR accelerator and a multi-core spatial architecture—we achieve 9.2× higher throughput and 71.2× better energy efficiency on A100 versus baseline dense attention. Against state-of-the-art accelerators, our design improves energy efficiency and area efficiency by 16.1× and 27.1×, respectively. The Spatial-STAR architecture delivers 20.1× throughput gain.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$ imes$ speedup and 71.2$ imes$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$ imes$ energy and 27.1$ imes$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$ imes$ throughput improvement.
Problem

Research questions and friction points this paper is trying to address.

Optimizes sparse attention computation for large language models
Reduces redundant operations in high-throughput inference scenarios
Improves memory efficiency for long sequence processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-stage coordination reduces computation and memory access
Leading-zero-based sparsity prediction minimizes prediction overhead
Coordinated tiling strategy enables fine-grained stage interaction
🔎 Similar Papers
No similar papers found.
Huizheng Wang
Huizheng Wang
Tsinghua University
Sparse AttentionLLM acceleratorAI InfraDistrbited ParallelismVLSI
T
Taiquan Wei
School of Integrated Circuits, Tsinghua University, Beijing, 100084, China
H
Hongbin Wang
School of Integrated Circuits, Tsinghua University, Beijing, 100084, China
Z
Zichuan Wang
School of Integrated Circuits, Tsinghua University, Beijing, 100084, China
Xinru Tang
Xinru Tang
University of California, Irvine
HCICSCWAccessibility
Z
Zhiheng Yue
School of Integrated Circuits, Tsinghua University, Beijing, 100084, China
Shaojun Wei
Shaojun Wei
Professor, Tsinghua University
Y
Yang Hu
School of Integrated Circuits, Tsinghua University, Beijing, 100084, China
Shouyi Yin
Shouyi Yin
Tsinghua University