SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spiking neural networks (SNNs) for speech command recognition suffer from limited temporal modeling capacity due to binary spike representations, hindering effective capture of long-range dependencies and contextual information. To address this, we propose SpikCommander—a fully spike-driven Transformer architecture—featuring two novel components: the Multi-View Spike-Temporal Aware Self-Attention (MSTASA) module and the spike-based Context-Refinement Channel MLP. MSTASA enables fine-grained temporal modeling through multi-view spike sequence encoding, while the channel-wise MLP refines contextual features in a spike-native manner. By integrating event-driven feature extraction with multi-view learning, SpikCommander achieves state-of-the-art performance on three major benchmarks—SHD, SSC, and GSC—outperforming prior SNN methods in accuracy, parameter efficiency, and inference speed (fewer timesteps). This work significantly advances the modeling capability and practical applicability of SNNs for temporal speech tasks.

Technology Category

Application Category

📝 Abstract
Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.
Problem

Research questions and friction points this paper is trying to address.

Capturing temporal dependencies in speech using spiking neural networks
Overcoming binary spike limitations for contextual speech modeling
Enhancing energy-efficient speech recognition with transformer architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view spiking temporal-aware self-attention module
Fully spike-driven transformer architecture design
Spiking contextual refinement channel MLP integration
🔎 Similar Papers
No similar papers found.
J
Jiaqi Wang
Harbin Institute of Technology, Shenzhen
L
Liutao Yu
Pengcheng Laboratory
X
Xiongri Shen
Harbin Institute of Technology, Shenzhen
S
Sihang Guo
Harbin Institute of Technology, Shenzhen
Chenlin Zhou
Chenlin Zhou
Peking University & Pengcheng Laboratory
Efficient Artificial IntelligenceBrain-inspired Computing
L
Leilei Zhao
Harbin Institute of Technology, Shenzhen
Y
Yi Zhong
Harbin Institute of Technology, Shenzhen
Z
Zhiguo Zhang
Harbin Institute of Technology, Shenzhen
Zhengyu Ma
Zhengyu Ma
Pengcheng Laboratory
NeuroscienceNeural Network DynamicsComputational Physics