Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the limitations of existing audio–text retrieval methods, which suffer from degraded performance on long-duration, noisy, and weakly labeled audio and exhibit instability under small-batch training. To overcome these challenges, the authors propose a cross-modal embedding refinement mechanism that integrates Transformer-based projection, linear mapping, and bidirectional attention. Additionally, they introduce a silence-aware chunking strategy coupled with attentive pooling to better capture relevant audio segments. A hybrid loss function combining cosine similarity, L1 regularization, and contrastive loss is designed to enhance model robustness and training stability. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods on standard benchmarks, exhibiting particularly strong robustness in noisy conditions with signal-to-noise ratios between 5 and 15 dB.

Technology Category

Application Category

📝 Abstract

Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, $\mathcal{L}_{1}$, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.

Problem

Research questions and friction points this paper is trying to address.

audio-text retrieval

noisy audio

weakly labeled data

long-form audio

robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-modal attention

hybrid loss

audio-text retrieval