🤖 AI Summary
Existing audio-text retrieval methods rely on single-layer cross-modal interaction, limiting their ability to model fine-grained semantic alignment and resulting in globally dominant representations with coarse-grained local matching. To address this, we propose the Hierarchical Alignment and Disentangled Representation framework (HADE). Our approach features: (1) Transformer-block-level Hierarchical Alignment (THA), the first method to explicitly model multi-granularity audio–text correspondences via layered cross-modal attention; and (2) Disentangled Confounder-aware Representation (DCR) coupled with Confidence-Adaptive Aggregation (CA), enabling interpretable, confidence-aware local semantic matching. THA captures alignments at varying levels of abstraction—token-, block-, and sequence-level—while DCR isolates modality-specific and shared semantic factors, and CA dynamically weights local matches based on estimated alignment confidence. Evaluated on AudioCaps and Clotho, HADE achieves state-of-the-art performance, improving Recall@1 by an average of 5.2% over prior methods.
📝 Abstract
Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.