🤖 AI Summary
Mainstream Transformer models suffer from quadratic computational complexity and linear memory growth with sequence length, alongside poor training stability on non-NVIDIA hardware. To address these limitations, we propose SpikingBrain—a brain-inspired spiking large language model—integrating adaptive spiking neurons, linear/hybrid-linear attention, and an event-driven sparse activation mechanism (69.15% sparsity), enabling near-constant memory inference and efficient ultra-long-sequence processing. Leveraging a conversion-based training pipeline, a dedicated spiking encoding framework, and system-level optimizations tailored for the MetaX GPU cluster, we successfully train 7B- and 76B-parameter models. On 4M-token sequences, SpikingBrain achieves over 100× speedup in first-token generation latency; the 7B model attains 23.4% FLOPs utilization and matches open-source baseline performance after only ~150B tokens of continued pretraining.
📝 Abstract
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware.
Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models significantly improve long-sequence training efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Training remains stable for weeks on hundreds of MetaX C550 GPUs, with the 7B model reaching a Model FLOPs Utilization of 23.4 percent. The proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.