🤖 AI Summary
This work addresses the challenge of effectively integrating heterogeneous language models—such as n-gram and neural network models—within hybrid automatic speech recognition (ASR) systems under the federated learning framework, a limitation that hinders rescoring performance and generalization. The study formalizes, for the first time, the optimization of heterogeneous language models in federated settings and introduces a general match-and-fuse paradigm. It further proposes two intelligent fusion strategies based on genetic algorithms (GMMA) and reinforcement learning (RMMA) to enable efficient cross-client pairing and joint optimization of non-homogeneous models. Experimental results across seven OpenSLR datasets demonstrate that RMMA achieves the lowest average character error rate, outperforms existing baselines in generalization, and converges up to seven times faster than GMMA.
📝 Abstract
Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.