Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address the cross-modal embedding misalignment caused by modality heterogeneity between audio and text in open-vocabulary keyword spotting, this paper proposes a fine-grained alignment framework based on deep metric learning. Our method introduces three key innovations: (1) a novel modality-adversarial learning (MAL) mechanism that enforces audio and text encoders to produce modality-invariant embeddings; (2) the first phoneme-level cross-modal alignment modeling, enabling granular semantic correspondence; and (3) a systematic comparison of multiple metric learning objectives, integrated into a multi-task joint optimization framework. Evaluated on WSJ and LibriPhrase, our approach reduces inter-modal embedding distribution distance by 37% and improves top-1 cross-modal retrieval accuracy by 9.2%, yielding substantial gains in open-vocabulary keyword recognition performance.

Technology Category

Application Category

📝 Abstract

For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct comprehensive comparisons across various DML objectives. Experiments on the Wall Street Journal (WSJ) and LibriPhrase datasets demonstrate the effectiveness of the proposed approach.

Problem

Research questions and friction points this paper is trying to address.

Addressing modality gap in audio-text alignment for keyword spotting

Enhancing cross-modal embedding comparison via adversarial learning

Optimizing phoneme-level alignment using deep metric learning objectives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality Adversarial Learning reduces domain gap

Deep metric learning enables shared embedding space

Phoneme-level alignment between audio and text

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation