Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing spiking neural networks (SNNs) are predominantly unimodal, limiting effective audio-visual cross-modal representation learning. To address this, we propose the first brain-inspired multimodal SNN framework for audio-visual fusion, built upon a spiking Transformer architecture. Our method introduces four key innovations: (1) spatiotemporal spiking attention, (2) cross-modal residual connections, (3) shared semantic space projection, and (4) contrastive-driven semantic alignment optimization. By performing semantic alignment and residual cross-modal interaction directly in the spike domain, our approach significantly enhances feature consistency and complementarity. We evaluate on three benchmarks—CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS—achieving state-of-the-art performance across all, substantially outperforming existing audio-visual SNN methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Humans interpret and perceive the world by integrating sensory information from multiple modalities, such as vision and hearing. Spiking Neural Networks (SNNs), as brain-inspired computational models, exhibit unique advantages in emulating the brain's information processing mechanisms. However, existing SNN models primarily focus on unimodal processing and lack efficient cross-modal information fusion, thereby limiting their effectiveness in real-world multimodal scenarios. To address this challenge, we propose a semantic-alignment cross-modal residual learning (S-CMRL) framework, a Transformer-based multimodal SNN architecture designed for effective audio-visual integration. S-CMRL leverages a spatiotemporal spiking attention mechanism to extract complementary features across modalities, and incorporates a cross-modal residual learning strategy to enhance feature integration. Additionally, a semantic alignment optimization mechanism is introduced to align cross-modal features within a shared semantic space, improving their consistency and complementarity. Extensive experiments on three benchmark datasets CREMA-D, UrbanSound8K-AV, and MNISTDVS-NTIDIGITS demonstrate that S-CMRL significantly outperforms existing multimodal SNN methods, achieving the state-of-the-art performance. The code is publicly available at https://github.com/Brain-Cog-Lab/S-CMRL.

Problem

Research questions and friction points this paper is trying to address.

Enhance cross-modal information fusion in SNNs

Improve audio-visual integration through semantic alignment

Achieve state-of-the-art performance in multimodal SNNs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based multimodal SNN architecture

Spatiotemporal spiking attention mechanism

Semantic alignment optimization mechanism

🔎 Similar Papers

SpikeCLIP: A Contrastive Language-Image Pretrained Spiking Neural Network