Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

πŸ“… 2026-04-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

228K/year
πŸ€– AI Summary
This work addresses the limitations of existing audio–text retrieval methods, which suffer from degraded performance on long-duration, noisy, and weakly labeled audio and exhibit instability under small-batch training. To overcome these challenges, the authors propose a cross-modal embedding refinement mechanism that integrates Transformer-based projection, linear mapping, and bidirectional attention. Additionally, they introduce a silence-aware chunking strategy coupled with attentive pooling to better capture relevant audio segments. A hybrid loss function combining cosine similarity, L1 regularization, and contrastive loss is designed to enhance model robustness and training stability. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods on standard benchmarks, exhibiting particularly strong robustness in noisy conditions with signal-to-noise ratios between 5 and 15 dB.

Technology Category

Application Category

πŸ“ Abstract
Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.
Problem

Research questions and friction points this paper is trying to address.

audio-text retrieval
noisy audio
weakly labeled data
long-form audio
robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal attention
hybrid loss
audio-text retrieval
silence-aware chunking
embedding refinement