🤖 AI Summary
To address critical limitations of large language models (LLMs) in automatic speech recognition (ASR)—including severe hallucination, poor cross-scenario generalization, and substantial performance degradation on industrial benchmarks versus open-source ones—this paper proposes a production-oriented end-to-end ASR system. Methodologically, it innovatively integrates large-scale multilingual speech pretraining, deep LLM-augmented joint acoustic-language modeling, and a streaming end-to-end architecture, further enhanced by reinforcement learning for sequence-level robustness optimization. This yields significant improvements in noise robustness, code-switching accuracy, and hotword customization. Evaluated on a realistic industrial test set, the system achieves state-of-the-art performance, substantially outperforming leading open-source models. Results demonstrate its effectiveness and practicality in challenging real-world scenarios—including multilingual mixing, high-noise environments, and dynamically evolving domains.
📝 Abstract
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.