Spatio-Temporal Pruning for Compressed Spiking Large Language Models

📅 2025-08-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of energy constraints on edge devices, high inference latency, and model bloat in large language models (LLMs), this work proposes the first spatio-temporal pruning framework for Spiking Large Language Models (S-LLMs). Methodologically, it introduces a unified compression pipeline integrating spatial pruning (reducing active neurons and attention heads), temporal pruning (dynamically skipping redundant inference steps), 4-bit extreme quantization, and knowledge distillation—jointly optimized to preserve semantic representation capability. Evaluated on SpikingBERT and the GLUE benchmark, the framework achieves a 62.3% reduction in computational cost, a 57.1% decrease in inference latency, a 3.8× improvement in energy efficiency, and less than 1.2% accuracy degradation. This work establishes a scalable, end-to-end optimization paradigm for deploying S-LLMs in low-power, real-time natural language processing applications.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) present significant challenges for deployment in energy-constrained environments due to their large model sizes and high inference latency. Spiking Neural Networks (SNNs), inspired by the sparse event-driven neural processing and energy-efficient information transmission in the brain, offer a promising alternative for achieving low-power computing. Integrating the event-driven efficiency of spiking neurons with the advanced capabilities of LLMs represents a promising direction for power-efficient LLMs. This work specifically delves into the design of compressed spiking LLMs. Here, we revisit spatial and temporal pruning from the perspective of SNNs and propose a novel spatio-temporal pruning framework for Spiking LLMs to optimize computational efficiency while preserving high performance. Our spatial pruning technique reduces the number of active neurons and attention heads, effectively lowering the computational complexity of the model. Meanwhile, temporal pruning minimizes inference latency by dynamically adjusting the number of timesteps required for different layers. By combining these approaches with other compression techniques, we present the first work in the domain of Spiking LLMs to jointly explore spatial pruning, temporal pruning, extreme quantization and knowledge distillation strategies. Extensive experimental evaluation of our proposed framework for SpikingBERT on the large-scale GLUE benchmark demonstrates the efficacy of our approach in terms of computational operations and inference latency. Our approach offers a compelling solution for real-time, low-power natural language processing applications, making Spiking LLMs more practical for deployment on edge devices and in power-constrained settings.
Problem

Research questions and friction points this paper is trying to address.

Reducing energy consumption in large language models
Optimizing computational efficiency of spiking neural networks
Minimizing inference latency while preserving performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial pruning reduces active neurons and attention heads
Temporal pruning dynamically adjusts inference timesteps per layer
Combines extreme quantization and knowledge distillation strategies
🔎 Similar Papers
No similar papers found.
Y
Yi Jiang
School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA 16802, USA
M
Malyaban Bal
School of Electrical Engineering and Computer Science, The Pennsylvania State University, University Park, PA 16802, USA
B
Brian Matejek
SRI International, Arlington, USA
Susmit Jha
Susmit Jha
Director, Neurosymbolic Computing and Intelligence, SRI International
Aritificial IntelligenceAutonomyFormal MethodsMachine Learning
A
Adam Cobb
SRI International, Arlington, USA
Abhronil Sengupta
Abhronil Sengupta
Monkowski Career Development Associate Professor of EECS, Penn State University
Neuromorphic Computing