Do Audio-Language Models Understand Linguistic Variations?

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary audio-language models (e.g., CLAP) exhibit significant performance degradation under linguistic variations such as paraphrasing and syntactic rephrasing. This work is the first to systematically characterize the vulnerability of audio-language models (ALMs) to such semantic-preserving textual perturbations. Method: We propose RobustCLAP, a robustness-enhancement framework based on multi-view contrastive learning. It treats diverse textual descriptions (e.g., paraphrases) of the same audio as semantically consistent views, introduces a novel multi-view contrastive loss, and integrates linguistic-variation-aware data augmentation during training. Contribution/Results: Evaluated across multiple text-to-audio retrieval benchmarks, RobustCLAP achieves average improvements of 0.8–13.0 percentage points in Recall@1. It substantially enhances model robustness to semantically equivalent yet lexically diverse queries, establishing a scalable, contrastive learning paradigm for robust open-vocabulary cross-modal understanding.

Technology Category

Application Category

📝 Abstract
Open-vocabulary audio language models (ALMs), like Contrastive Language Audio Pretraining (CLAP), represent a promising new paradigm for audio-text retrieval using natural language queries. In this paper, for the first time, we perform controlled experiments on various benchmarks to show that existing ALMs struggle to generalize to linguistic variations in textual queries. To address this issue, we propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations agnostic to linguistic variations. Specifically, we reformulate the contrastive loss used in CLAP architectures by introducing a multi-view contrastive learning objective, where paraphrases are treated as different views of the same audio scene and use this for training. Our proposed approach improves the text-to-audio retrieval performance of CLAP by 0.8%-13% across benchmarks and enhances robustness to linguistic variation.
Problem

Research questions and friction points this paper is trying to address.

Generalization to linguistic variations
Audio-text retrieval efficiency
Multi-view contrastive learning objective
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Language Audio Pretraining
Multi-view contrastive learning
RobustCLAP enhances text-to-audio retrieval
🔎 Similar Papers
No similar papers found.