🤖 AI Summary
To address throughput and energy-efficiency bottlenecks of dynamic sparse attention in long-sequence LLM inference under large-token parallel processing (LTPP), this work departs from conventional stage-isolated optimization, proposing the first cross-stage co-optimized compute-memory framework. Key innovations include: (1) logarithmic-domain addition with leading-zero-based sparse prediction; (2) distributed sorting-guided FlashAttention update; and (3) coordinated tiling for fine-grained inter-stage interaction. Through algorithm–hardware co-design—integrating a customized STAR accelerator and a multi-core spatial architecture—we achieve 9.2× higher throughput and 71.2× better energy efficiency on A100 versus baseline dense attention. Against state-of-the-art accelerators, our design improves energy efficiency and area efficiency by 16.1× and 27.1×, respectively. The Spatial-STAR architecture delivers 20.1× throughput gain.
📝 Abstract
Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm-hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to 9.2$ imes$ speedup and 71.2$ imes$ energy efficiency over A100, and surpassing SOTA accelerators by up to 16.1$ imes$ energy and 27.1$ imes$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a 20.1$ imes$ throughput improvement.