๐ค AI Summary
To address bandwidth bottlenecks and energy inefficiency in inter-chip communication for AI accelerators, this work proposes SNAPโa hybrid architecture that deploys end-to-end trainable, learnable sparse Spiking Neural Networks (SNNs) exclusively at die-to-die interfaces for low-overhead data encoding, while retaining high-precision Artificial Neural Networks (ANNs) on-die for core computation. We introduce the first algorithm-architecture co-design framework that strictly confines SNNs to communication partitions, enables joint optimization of structured sparsity, and supports composable heterogeneous integration of SNN and ANN pathways. Evaluated on language and vision tasks, SNAP achieves up to 5.3ร higher energy efficiency and 15.2ร lower inference latency versus baseline ANN-only and SNN-only approaches; these gains scale with model size, overcoming the dual limitations of conventional SNNsโpoor scalability and subpar task performance.
๐ Abstract
Efficient communication is central to both biological and artificial intelligence (AI) systems. In biological brains, the challenge of long-range communication across regions is addressed through sparse, spike-based signaling, minimizing energy consumption and latency. In contrast, modern AI workloads, which keep scaling ever larger across distributed compute systems, are increasingly constrained by bandwidth limitations, creating bottlenecks that hinder scalability and energy efficiency. Inspired by the brain's efficient communication strategies, we propose SNAP, a hybrid neural network architecture combining spiking neural networks (SNNs) and artificial neural networks (ANNs) to address these challenges. SNAP integrates SNNs at bandwidth-constrained regions, such as chip boundaries, where spike-based encoding reduces data transfer overhead. Within each chip, dense ANN computations are maintained to preserve high throughput, accuracy, and robustness. Historically, SNNs have faced difficulties scaling up, with limitations in task-specific performance and reliance on specialized hardware to exploit sparsity. SNAP overcomes these barriers through an algorithm-architecture co-design leveraging learnable sparsity for die-to-die communication while limiting spiking layers to specific network partitions. This composable design integrates spike-based and non-spiking pathways, making it adaptable to diverse deep learning workloads. Our evaluations on language processing and computer vision tasks demonstrate up to 5.3x energy efficiency improvements and 15.2x reductions in inference latency, outperforming both traditional SNNs and non-spiking models. We find that as model resources scale, SNAP's improvement margins grow. By addressing the critical bottleneck of inter-chip communication, SNAP offers a scalable, biologically inspired pathway to more efficient AI systems.