🤖 AI Summary
This work addresses the insufficient modeling of temporal dynamics and harmonic spectral structures in heart murmur classification (HMC). We propose a cross-modal learning framework that jointly leverages neural audio codec representations (NACRs, e.g., EnCodec) and handcrafted spectral features (SFs, e.g., MFCCs). Our key innovation is a bandwidth-aware cross-attention mechanism inspired by multi-armed bandits, which dynamically selects and reweights critical attention heads to suppress modality-specific noise while enhancing discriminative feature fusion. The method is fully end-to-end trainable without requiring auxiliary annotations. Evaluated on standard phonocardiogram datasets, it significantly outperforms unimodal baselines and conventional feature concatenation or weighted fusion approaches, establishing new state-of-the-art performance for HMC.
📝 Abstract
In this study, we focus on heart murmur classification (HMC) and hypothesize that combining neural audio codec representations (NACRs) such as EnCodec with spectral features (SFs), such as MFCC, will yield superior performance. We believe such fusion will trigger their complementary behavior as NACRs excel at capturing fine-grained acoustic patterns such as rhythm changes, spectral features focus on frequency-domain properties such as harmonic structure, spectral energy distribution crucial for analyzing the complex of heart sounds. To this end, we propose, BAOMI, a novel framework banking on novel bandit-based cross-attention mechanism for effective fusion. Here, a agent provides more weightage to most important heads in multi-head cross-attention mechanism and helps in mitigating the noise. With BAOMI, we report the topmost performance in comparison to individual NACRs, SFs, and baseline fusion techniques and setting new state-of-the-art.