Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration

📅 2025-04-24

📈 Citations: 0

✨ Influential: 0

career value

260K/year

🤖 AI Summary

State Space Models (SSMs) suffer from severe memory bandwidth and on-chip storage bottlenecks during long-sequence prefilling, limiting hardware acceleration efficiency. This paper proposes an adaptive, memory-aware fine-grained operator fusion methodology, jointly optimizing scheduling, dataflow restructuring, and fusion-aware hardware co-design within an extended Stream framework to systematically explore the SSM accelerator design space. We establish operator fusion as a key enabling technique for next-generation SSM accelerators—reducing on-chip memory requirements by an order of magnitude. Compared to non-fused execution, our approach achieves 4.8× end-to-end speedup. Under identical area constraints, it delivers 1.78× higher performance than the state-of-the-art MARCA accelerator.

Technology Category

Application Category

📝 Abstract

State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first effort to accelerate SSMs through a dedicated hardware accelerator, achieves great speedup over high-end GPUs, an analysis into the broader accelerator design space is lacking. This work systematically analyzes SSM acceleration opportunities both from the scheduling perspective through fine-grained operator fusion and the hardware perspective through design space exploration, using an extended version of the Stream modeling framework. Our results demonstrate that the improved data locality stemming from our optimized fusion and scheduling strategy enables a speedup of up to 4.8x over unfused execution, while our adaptive memory-aware fusion approach reduces on-chip memory requirements by an order of magnitude without sacrificing performance. We further explore accelerator design trade-offs, showing that a fusion-aware hardware architecture can achieve 1.78x higher performance than the state-of-the-art MARCA accelerator, within the same area budget. These results establish operator fusion as a key enabler for next-generation SSM accelerators.

Problem

Research questions and friction points this paper is trying to address.

Optimizing memory-bound operations in State Space Models

Exploring scheduling and hardware design for SSM acceleration

Improving data locality and reducing memory requirements in SSMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained operator fusion for efficiency

Adaptive memory-aware fusion reduces memory

Fusion-aware hardware boosts performance significantly

🔎 Similar Papers

On Efficient Variants of Segment Anything Model: A Survey