FireFly-T: High-Throughput Sparsity Exploitation for Spiking Transformer Acceleration with Dual-Engine Overlay Architecture

📅 2025-05-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing hardware accelerators struggle to efficiently support fine-grained activation sparsity and spike-based attention computation in Spiking Transformers. To address this, we propose a dual-engine co-designed architecture: a sparse engine for dynamic activation sparsity and a binary engine optimized for AND-PopCount operations in spike attention. Key contributions include a high-throughput sparse decoder, a conflict-free load-balancing mechanism, an SRAM byte-write-enabled 3D dataflow for spike attention, and LUT6-level logic synthesis optimization. Implemented on Xilinx FPGAs, the design supports dynamic dataflow orchestration, multi-dimensional parallelism, and out-of-order scheduling. Compared to FireFly v2 and SpikeTA, our accelerator achieves 1.39× and 2.40× higher energy efficiency, respectively, and improves DSP utilization by 4.21× and 7.10×. These advances significantly enhance hardware scalability and deployment efficiency for Spiking Transformers.

Technology Category

Application Category

📝 Abstract
Spiking transformers are emerging as a promising architecture that combines the energy efficiency of Spiking Neural Networks (SNNs) with the powerful attention mechanisms of transformers. However, existing hardware accelerators lack support for spiking attention, exhibit limited throughput in exploiting fine-grained sparsity, and struggle with scalable parallelism in sparse computation. To address these, we propose FireFly-T, a dual-engine overlay architecture that integrates a sparse engine for activation sparsity and a binary engine for spiking attention. In the sparse engine, we propose a highthroughput sparse decoder that exploits fine-grained sparsity by concurrently extracting multiple non-zero spikes. To complement this, we introduce a scalable load balancing mechanism with weight dispatch and out-of-order execution, eliminating bank conflicts to support scalable multidimensional parallelism. In the binary engine, we leverage the byte-level write capability of SRAMs to efficiently manipulate the 3D dataflows required for spiking attention with minimal resource overhead. We also optimize the core AND-PopCount operation in spiking attention through a LUT6-based implementation, improving timing closure and reducing LUT utilization on Xilinx FPGAs. As an overlay architecture, FireFly-T further incorporates an orchestrator that dynamically manipulates input dataflows with flexible adaptation for diverse network topologies, while ensuring efficient resource utilization and maintaining high throughput. Experimental results demonstrate that our accelerator achieves $1.39 imes$ and $2.40 imes$ higher energy efficiency, as well as $4.21 imes$ and $7.10 imes$ greater DSP efficiency, compared to FireFly v2 and the transformer-enabled SpikeTA, respectively. These results highlight its potential as an efficient hardware platform for spiking transformer.
Problem

Research questions and friction points this paper is trying to address.

Lack of hardware support for spiking attention mechanisms
Limited throughput in exploiting fine-grained sparsity
Challenges in scalable parallelism for sparse computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-engine overlay for sparsity and attention
High-throughput sparse decoder for non-zero spikes
LUT6-based AND-PopCount for timing optimization
Tenglong Li
Tenglong Li
Institute of Automation, Chinese Academy of Sciences
Hardware Architecture
J
Jindong Li
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
G
Guobin Shen
School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China; and the Center for Long-term AI, Beijing 101407, China
Dongcheng Zhao
Dongcheng Zhao
Beijing Institute of AI Safety and Governance
Spiking Neural NetworksEvent Based VisionBrain-inspired AILLM Safety
Q
Qian Zhang
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; and the Center for Long-term AI, Beijing 101407, China
Y
Yi Zeng
Center for Long-term AI, Beijing 101407, China; and the State Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Shanghai 200031, China