Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

In end-to-end speech recognition, cross-modal knowledge transfer from language models to acoustic modeling suffers from misalignment between linguistic and acoustic representations due to heterogeneous modal structures. Method: This paper proposes the Graph Matching Optimal Transport (GM-OT) framework, which models linguistic and acoustic sequences as temporally structured graphs and jointly optimizes node-level Wasserstein distance and edge-level Gromov–Wasserstein distance. It introduces the Fused Gromov–Wasserstein Distance (FGWD), theoretically unifying existing optimal transport formulations. Contribution/Results: GM-OT is the first to integrate graph-structured representation learning with optimal transport for cross-modal knowledge transfer. Integrated into a CTC-based E2E-ASR architecture with PLM-guided knowledge distillation, it achieves significant improvements over state-of-the-art methods on Chinese ASR benchmarks. Empirical results demonstrate that structured graph alignment enhances both the effectiveness and robustness of cross-modal knowledge transfer.

Technology Category

Application Category

📝 Abstract

Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance (GWD) (between edges), leading to a fused Gromov-Wasserstein distance (FGWD) formulation. This enables structured alignment and more efficient knowledge transfer compared to existing OT-based approaches. Theoretical analysis further shows that prior OT-based methods in linguistic knowledge transfer can be viewed as a special case within our GM-OT framework. We evaluate GM-OT on Mandarin ASR using a CTC-based E2E-ASR system with a PLM for knowledge transfer. Experimental results demonstrate significant performance gains over state-of-the-art models, validating the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Aligning linguistic and acoustic modalities in ASR

Overcoming structural relationship neglect in OT methods

Enhancing knowledge transfer via graph matching OT

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph Matching Optimal Transport for structured alignment

Fused Gromov-Wasserstein distance for multimodal knowledge transfer

Modeling sequences as graphs with nodes and edges

🔎 Similar Papers

Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach