🤖 AI Summary
To address the high energy consumption and low energy efficiency hindering large language model (LLM) deployment on edge devices, this work proposes the first matrix-multiplication-free (MatMul-free) LLM architecture tailored for Intel Loihi 2 neuromorphic processors. The architecture deeply integrates brain-inspired computing principles with hardware-aware quantization, event-driven sparse computation, state-preserving neural dynamics modeling, and Loihi 2–specific compiler mapping. It enables efficient inference of a 370M-parameter model with zero accuracy loss. Experimental results demonstrate that, compared to a Transformer baseline running on an edge GPU, our approach achieves a 3× throughput improvement and 50% energy reduction, while exhibiting superior scalability. This work constitutes the first empirical validation of spiking neural network (SNN) paradigms for efficient, scalable LLM inference at the edge—establishing both feasibility and clear advantages over conventional deep learning approaches.
📝 Abstract
Large language models (LLMs) deliver impressive performance but require large amounts of energy. In this work, we present a MatMul-free LLM architecture adapted for Intel's neuromorphic processor, Loihi 2. Our approach leverages Loihi 2's support for low-precision, event-driven computation and stateful processing. Our hardware-aware quantized model on GPU demonstrates that a 370M parameter MatMul-free model can be quantized with no accuracy loss. Based on preliminary results, we report up to 3x higher throughput with 2x less energy, compared to transformer-based LLMs on an edge GPU, with significantly better scaling. Further hardware optimizations will increase throughput and decrease energy consumption. These results show the potential of neuromorphic hardware for efficient inference and pave the way for efficient reasoning models capable of generating complex, long-form text rapidly and cost-effectively.