Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

📅 2024-09-01

🏛️ Interspeech

📈 Citations: 2

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing audio-text retrieval methods rely on single-layer cross-modal interaction, limiting their ability to model fine-grained semantic alignment and resulting in globally dominant representations with coarse-grained local matching. To address this, we propose the Hierarchical Alignment and Disentangled Representation framework (HADE). Our approach features: (1) Transformer-block-level Hierarchical Alignment (THA), the first method to explicitly model multi-granularity audio–text correspondences via layered cross-modal attention; and (2) Disentangled Confounder-aware Representation (DCR) coupled with Confidence-Adaptive Aggregation (CA), enabling interpretable, confidence-aware local semantic matching. THA captures alignments at varying levels of abstraction—token-, block-, and sequence-level—while DCR isolates modality-specific and shared semantic factors, and CA dynamically weights local matches based on estimated alignment confidence. Evaluated on AudioCaps and Clotho, HADE achieves state-of-the-art performance, improving Recall@1 by an average of 5.2% over prior methods.

Technology Category

Application Category

📝 Abstract

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR framework that leverages two-stream Transformers in conjunction with a Hierarchical Alignment (THA) module to identify multi-level correspondences of different Transformer blocks between audio and text. Moreover, current ATR methods mainly focus on learning a global-level representation, missing out on intricate details to capture audio occurrences that correspond to textual semantics. To bridge this gap, we introduce a Disentangled Cross-modal Representation (DCR) approach that disentangles high-dimensional features into compact latent factors to grasp fine-grained audio-text semantic correlations. Additionally, we develop a confidence-aware (CA) module to estimate the confidence of each latent factor pair and adaptively aggregate cross-modal latent factors to achieve local semantic alignment. Experiments show that our THA effectively boosts ATR performance, with the DCR approach further contributing to consistent performance gains.

Problem

Research questions and friction points this paper is trying to address.

Improving audio-text retrieval via hierarchical alignment

Disentangling cross-modal features for fine-grained semantics

Enhancing local alignment with confidence-aware latent factors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stream Transformers with Hierarchical Alignment

Disentangled Cross-modal Representation for fine-grained correlations

Confidence-aware module for adaptive latent factor aggregation

🔎 Similar Papers

No similar papers found.