🤖 AI Summary
Existing audio-visual fusion methods struggle to balance cross-modal dependency modeling with computational efficiency, limiting the scalability of multi-scale architectures. To address this challenge, this work proposes SNNergy, a novel framework that achieves hierarchical multi-scale cross-modal fusion with linear complexity for the first time. At its core lies the CMQKA mechanism, which leverages event-driven binary spiking operations to construct an efficient bidirectional Query-Key attention and integrates a learnable residual fusion strategy. Evaluated on benchmark datasets including CREMA-D, AVE, and UrbanSound8K-AV, the proposed method attains state-of-the-art performance, significantly outperforming existing approaches while demonstrating exceptional energy efficiency.
📝 Abstract
Effective multimodal fusion requires mechanisms that can capture complex cross-modal dependencies while remaining computationally scalable for real-world deployment. Existing audio-visual fusion approaches face a fundamental trade-off: attention-based methods effectively model cross-modal relationships but incur quadratic computational complexity that prevents hierarchical, multi-scale architectures, while efficient fusion strategies rely on simplistic concatenation that fails to extract complementary cross-modal information. We introduce CMQKA, a novel cross-modal fusion mechanism that achieves linear O(N) complexity through efficient binary operations, enabling scalable hierarchical fusion previously infeasible with conventional attention. CMQKA employs bidirectional cross-modal Query-Key attention to extract complementary spatiotemporal features and uses learnable residual fusion to preserve modality-specific characteristics while enriching representations with cross-modal information. Building upon CMQKA, we present SNNergy, an energy-efficient multimodal fusion framework with a hierarchical architecture that processes inputs through progressively decreasing spatial resolutions and increasing semantic abstraction. This multi-scale fusion capability allows the framework to capture both local patterns and global context across modalities. Implemented with event-driven binary spike operations, SNNergy achieves remarkable energy efficiency while maintaining fusion effectiveness and establishing new state-of-the-art results on challenging audio-visual benchmarks, including CREMA-D, AVE, and UrbanSound8K-AV, significantly outperforming existing multimodal fusion baselines. Our framework advances multimodal fusion by introducing a scalable fusion mechanism that enables hierarchical cross-modal integration with practical energy efficiency for real-world audio-visual intelligence systems.