π€ AI Summary
This work addresses the limitations of existing audioβtext retrieval methods, which suffer from degraded performance on long-duration, noisy, and weakly labeled audio and exhibit instability under small-batch training. To overcome these challenges, the authors propose a cross-modal embedding refinement mechanism that integrates Transformer-based projection, linear mapping, and bidirectional attention. Additionally, they introduce a silence-aware chunking strategy coupled with attentive pooling to better capture relevant audio segments. A hybrid loss function combining cosine similarity, L1 regularization, and contrastive loss is designed to enhance model robustness and training stability. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods on standard benchmarks, exhibiting particularly strong robustness in noisy conditions with signal-to-noise ratios between 5 and 15 dB.
π Abstract
Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.