Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-visual fusion methods struggle to balance cross-modal dependency modeling with computational efficiency, limiting the scalability of multi-scale architectures. To address this challenge, this work proposes SNNergy, a novel framework that achieves hierarchical multi-scale cross-modal fusion with linear complexity for the first time. At its core lies the CMQKA mechanism, which leverages event-driven binary spiking operations to construct an efficient bidirectional Query-Key attention and integrates a learnable residual fusion strategy. Evaluated on benchmark datasets including CREMA-D, AVE, and UrbanSound8K-AV, the proposed method attains state-of-the-art performance, significantly outperforming existing approaches while demonstrating exceptional energy efficiency.

Technology Category

Application Category

📝 Abstract
Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.
Problem

Research questions and friction points this paper is trying to address.

cross-modal fusion
computational complexity
audio-visual learning
multimodal integration
energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Binary Attention
Linear Complexity
Hierarchical Fusion
Energy-Efficient Multimodal Learning
Spiking Neural Networks
🔎 Similar Papers
No similar papers found.
M
Mohamed Saleh
Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover, Germany
Zahra Ahmadi
Zahra Ahmadi
Junior Group Leader, PLRI Medical Informatics Institute, Medical School of Hannover
Human-centered AIMultimodal LearningData MiningMachine Learning