Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

📅 2026-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing image-text retrieval methods, which struggle to balance high performance with low power consumption due to inefficiencies in cross-modal interaction, latency, and energy usage. To overcome these challenges, the authors propose a brain-inspired Cross-Modal Spiking Fusion network (CMSF), which, for the first time, directly applies Spiking Neural Networks (SNNs) to image-text retrieval. CMSF introduces a spike-level cross-modal fusion mechanism that efficiently integrates unimodal features and enhances multimodal representations within only two time steps. This approach significantly reduces both energy consumption and inference latency while achieving retrieval accuracy that surpasses state-of-the-art artificial neural network (ANN) methods, thereby advancing the practical deployment of SNNs in multimodal retrieval tasks.
📝 Abstract
Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.
Problem

Research questions and friction points this paper is trying to address.

Spiking Neural Networks
Image-Text Retrieval
Multimodal Learning
Energy Efficiency
Cross-Modal Interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spiking Neural Networks
Multimodal Fusion
Image-Text Retrieval
Brain-Inspired Computing
Low-Energy AI
X
Xintao Zong
Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology
X
Xian Zhong
Hubei Key Laboratory of Transportation Internet of Things, Wuhan University of Technology
W
Wenxuan Liu
State Key Laboratory for Multimedia Information Processing, Peking University
Jianhao Ding
Jianhao Ding
Peking University
Spiking Neural NetworksOptimizationNeuromorphic VisionNeural coding
Zhaofei Yu
Zhaofei Yu
Peking University
Brain-inspired ComputingSpiking Neural NetworksComputational Neuroscience
Tiejun Huang
Tiejun Huang
Professor,School of Computer Science, Peking University
Visual Information Processing