🤖 AI Summary
This work addresses the limitations of existing image-text retrieval methods, which struggle to balance high performance with low power consumption due to inefficiencies in cross-modal interaction, latency, and energy usage. To overcome these challenges, the authors propose a brain-inspired Cross-Modal Spiking Fusion network (CMSF), which, for the first time, directly applies Spiking Neural Networks (SNNs) to image-text retrieval. CMSF introduces a spike-level cross-modal fusion mechanism that efficiently integrates unimodal features and enhances multimodal representations within only two time steps. This approach significantly reduces both energy consumption and inference latency while achieving retrieval accuracy that surpasses state-of-the-art artificial neural network (ANN) methods, thereby advancing the practical deployment of SNNs in multimodal retrieval tasks.
📝 Abstract
Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.