ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address inconsistent cross-lingual similarity matching in multilingual audio–text retrieval (ML-ATR), this paper identifies language-wise random sampling as the root cause of data distribution bias. We first propose a theoretical upper bound on weighting error to quantify and characterize this inconsistency. Building upon this analysis, we introduce two novel contrastive learning mechanisms: (1) 1-to-k cross-lingual contrastive learning, which strengthens semantic alignment across languages; and (2) audio–English co-anchored contrastive learning, leveraging English as a linguistic pivot to mitigate distributional shift. Evaluated on translated AudioCaps and Clotho benchmarks spanning eight major languages, our method achieves state-of-the-art recall performance and significantly improves cross-lingual consistency metrics, demonstrating enhanced robustness for multilingual retrieval.

Technology Category

Application Category

📝 Abstract

Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. We theoretically analyze the inconsistency in terms of both multilingual modal alignment direction error and weight error, and propose the theoretical weight error upper bound for quantifying the inconsistency. Based on the analysis of the weight error upper bound, we find that the inconsistency problem stems from the data distribution error caused by random sampling of languages. We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.

Problem

Research questions and friction points this paper is trying to address.

Addressing inconsistencies in multilingual audio-text retrieval

Reducing data distribution errors in ML-ATR schemes

Improving recall and consistency in cross-language retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

1-to-k contrastive learning

audio-English co-anchor learning

data distribution error mitigation

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

2024-09-01InterspeechCitations: 2