Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR

๐Ÿ“… 2025-10-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the dual challenges of catastrophic forgetting of general knowledge and interference from speaker-specific knowledge in personalized automatic speech recognition (ASR) fine-tuning, this paper proposes a Functional Invariant Path (FIP)-based knowledge disentanglement framework. The method employs parameter isolation and gated fusion to encode general acoustic representations and speaker-specific phonetic characteristics into separate sub-paths, enabling sequential, interference-free customization on synthetically generated personal data. Crucially, it requires no real speaker-specific utterancesโ€”only controllable synthetic speech is needed to model individual articulatory patterns. Experiments demonstrate a 29.38% relative reduction in character error rate (CER) on target speakers, while maintaining baseline-level general ASR performance. This confirms substantial improvements in both personalization efficacy and model generalization stability.

Technology Category

Application Category

๐Ÿ“ Abstract
Fine-tuning generic ASR models with large-scale synthetic personal data can enhance the personalization of ASR models, but it introduces challenges in adapting to synthetic personal data without forgetting real knowledge, and in adapting to personal data without forgetting generic knowledge. Considering that the functionally invariant path (FIP) framework enables model adaptation while preserving prior knowledge, in this letter, we introduce FIP into synthetic-data-augmented personalized ASR models. However, the model still struggles to balance the learning of synthetic, personalized, and generic knowledge when applying FIP to train the model on all three types of data simultaneously. To decouple this learning process and further address the above two challenges, we integrate a gated parameter-isolation strategy into FIP and propose a knowledge-decoupled functionally invariant path (KDFIP) framework, which stores generic and personalized knowledge in separate modules and applies FIP to them sequentially. Specifically, KDFIP adapts the personalized module to synthetic and real personal data and the generic module to generic data. Both modules are updated along personalization-invariant paths, and their outputs are dynamically fused through a gating mechanism. With augmented synthetic data, KDFIP achieves a 29.38% relative character error rate reduction on target speakers and maintains comparable generalization performance to the unadapted ASR baseline.
Problem

Research questions and friction points this paper is trying to address.

Balancing synthetic and real knowledge in personalized ASR fine-tuning
Preventing generic knowledge forgetting during personal data adaptation
Decoupling learning of synthetic, personalized, and generic ASR knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-decoupled FIP separates generic and personalized modules
Gated parameter-isolation strategy enables sequential knowledge adaptation
Dynamic fusion mechanism balances synthetic and real data learning
๐Ÿ”Ž Similar Papers
No similar papers found.
Yue Gu
Yue Gu
Research Center of Auditory Intelligence, School of Computer Science and Technology, Faculty of Computing, Harbin Institute of Technology, Harbin, China
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
Ying Shi
Ying Shi
Syracuse University
Education PolicyRacial InequalityLabor Economics
J
Jiqing Han
Research Center of Auditory Intelligence, School of Computer Science and Technology, Faculty of Computing, Harbin Institute of Technology, Harbin, China
Y
Yongjun He
Research Center of Auditory Intelligence, School of Computer Science and Technology, Faculty of Computing, Harbin Institute of Technology, Harbin, China