New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

In ASR, structural asymmetries—such as many-to-one and one-to-many acoustic–linguistic alignments—along with redundant or noisy frames hinder conventional alignment methods from simultaneously ensuring full linguistic unit coverage and acoustic sequence robustness. To address this, we reformulate acoustic–linguistic alignment as a detection task and propose a soft alignment framework grounded in unbalanced optimal transport (UOT). Our UOT-based approach inherently supports partial matching, tolerates distributional shifts and structural asymmetry, and provably guarantees that each linguistic token is associated with at least one acoustic observation. The framework integrates seamlessly into CTC-based ASR systems and enables effective knowledge transfer from pretrained language models. Experiments across diverse noise conditions and domain mismatches demonstrate substantial improvements in both recognition robustness and accuracy.

Technology Category

Application Category

📝 Abstract

Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.

Problem

Research questions and friction points this paper is trying to address.

Aligning acoustic and linguistic representations for ASR knowledge transfer

Handling structural asymmetries in acoustic-linguistic mapping relationships

Ensuring full linguistic token coverage while managing redundant acoustic frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unbalanced optimal transport for acoustic-linguistic alignment

Soft partial matching to handle distributional mismatches

Detection-based approach ensuring full linguistic token coverage

🔎 Similar Papers

No similar papers found.