Spatio-Temporal Pruning for Compressed Spiking Large Language Models

📅 2025-08-23

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

To address the challenges of energy constraints on edge devices, high inference latency, and model bloat in large language models (LLMs), this work proposes the first spatio-temporal pruning framework for Spiking Large Language Models (S-LLMs). Methodologically, it introduces a unified compression pipeline integrating spatial pruning (reducing active neurons and attention heads), temporal pruning (dynamically skipping redundant inference steps), 4-bit extreme quantization, and knowledge distillation—jointly optimized to preserve semantic representation capability. Evaluated on SpikingBERT and the GLUE benchmark, the framework achieves a 62.3% reduction in computational cost, a 57.1% decrease in inference latency, a 3.8× improvement in energy efficiency, and less than 1.2% accuracy degradation. This work establishes a scalable, end-to-end optimization paradigm for deploying S-LLMs in low-power, real-time natural language processing applications.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) present significant challenges for deployment in energy-constrained environments due to their large model sizes and high inference latency. Spiking Neural Networks (SNNs), inspired by the sparse event-driven neural processing and energy-efficient information transmission in the brain, offer a promising alternative for achieving low-power computing. Integrating the event-driven efficiency of spiking neurons with the advanced capabilities of LLMs represents a promising direction for power-efficient LLMs. This work specifically delves into the design of compressed spiking LLMs. Here, we revisit spatial and temporal pruning from the perspective of SNNs and propose a novel spatio-temporal pruning framework for Spiking LLMs to optimize computational efficiency while preserving high performance. Our spatial pruning technique reduces the number of active neurons and attention heads, effectively lowering the computational complexity of the model. Meanwhile, temporal pruning minimizes inference latency by dynamically adjusting the number of timesteps required for different layers. By combining these approaches with other compression techniques, we present the first work in the domain of Spiking LLMs to jointly explore spatial pruning, temporal pruning, extreme quantization and knowledge distillation strategies. Extensive experimental evaluation of our proposed framework for SpikingBERT on the large-scale GLUE benchmark demonstrates the efficacy of our approach in terms of computational operations and inference latency. Our approach offers a compelling solution for real-time, low-power natural language processing applications, making Spiking LLMs more practical for deployment on edge devices and in power-constrained settings.

Problem

Research questions and friction points this paper is trying to address.

Reducing energy consumption in large language models

Optimizing computational efficiency of spiking neural networks

Minimizing inference latency while preserving performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial pruning reduces active neurons and attention heads

Temporal pruning dynamically adjusts inference timesteps per layer

Combines extreme quantization and knowledge distillation strategies

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models