Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing spiking neural networks (SNNs) are predominantly unimodal, limiting effective audio-visual cross-modal representation learning. To address this, we propose the first brain-inspired multimodal SNN framework for audio-visual fusion, built upon a spiking Transformer architecture. Our method introduces four key innovations: (1) spatiotemporal spiking attention, (2) cross-modal residual connections, (3) shared semantic space projection, and (4) contrastive-driven semantic alignment optimization. By performing semantic alignment and residual cross-modal interaction directly in the spike domain, our approach significantly enhances feature consistency and complementarity. We evaluate on three benchmarks—CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS—achieving state-of-the-art performance across all, substantially outperforming existing audio-visual SNN methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at https://github.com/Brain-Cog-Lab/S-CMRL.
Problem

Research questions and friction points this paper is trying to address.

Enhance cross-modal information fusion in SNNs
Improve audio-visual integration through semantic alignment
Achieve state-of-the-art performance in multimodal SNNs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multimodal SNN architecture
Spatiotemporal spiking attention mechanism
Semantic alignment optimization mechanism
X
Xiang He
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Dongcheng Zhao
Dongcheng Zhao
Beijing Institute of AI Safety and Governance
Spiking Neural NetworksEvent Based VisionBrain-inspired AILLM Safety
Yiting Dong
Yiting Dong
Peking University, Institute of Automation, CAS
Brain Inspired IntelligenceSpiking Neural NetworkEvent-based VisionLarge Language Model
G
Guobin Shen
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and School of Future Technology, University of Chinese Academy of Sciences, Beijing 100049, China
X
Xin Yang
CAS Key Laboratory of Molecular Imaging, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Y
Yi Zeng
Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and Center for Long-term Artificial Intelligence, Beijing 100190, China, and University of Chinese Academy of Sciences, Beijing 100049, China, and Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Chinese Academy of Sciences, Shanghai, 200031, China