FunAudio-ASR Technical Report

📅 2025-09-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address critical limitations of large language models (LLMs) in automatic speech recognition (ASR)—including severe hallucination, poor cross-scenario generalization, and substantial performance degradation on industrial benchmarks versus open-source ones—this paper proposes a production-oriented end-to-end ASR system. Methodologically, it innovatively integrates large-scale multilingual speech pretraining, deep LLM-augmented joint acoustic-language modeling, and a streaming end-to-end architecture, further enhanced by reinforcement learning for sequence-level robustness optimization. This yields significant improvements in noise robustness, code-switching accuracy, and hotword customization. Evaluated on a realistic industrial test set, the system achieves state-of-the-art performance, substantially outperforming leading open-source models. Results demonstrate its effectiveness and practicality in challenging real-world scenarios—including multilingual mixing, high-noise environments, and dynamically evolving domains.

Technology Category

Application Category

📝 Abstract
In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
Problem

Research questions and friction points this paper is trying to address.

Addresses LLM hallucination in ASR systems
Optimizes ASR for real-world deployment challenges
Enhances performance on industry evaluation datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based ASR with reinforcement learning
Production-optimized for streaming and robustness
Enhanced noise robustness and hotword customization
🔎 Similar Papers
No similar papers found.
K
Keyu An
Tongyi Lab, Alibaba Group
Y
Yanni Chen
Tongyi Lab, Alibaba Group
Chong Deng
Chong Deng
alibaba group
machine learningnatural language processing
C
Changfeng Gao
Tongyi Lab, Alibaba Group
Z
Zhifu Gao
Tongyi Lab, Alibaba Group
B
Bo Gong
Tongyi Lab, Alibaba Group
Xiangang Li
Xiangang Li
Unknown affiliation
speech recognitionnatural language processing
Y
Yabin Li
Tongyi Lab, Alibaba Group
X
Xiang Lv
Tongyi Lab, Alibaba Group
Yunjie Ji
Yunjie Ji
Unknown affiliation
Yiheng Jiang
Yiheng Jiang
University of Science and Technology of China
Compression
B
Bin Ma
Tongyi Lab, Alibaba Group
H
Haoneng Luo
Tongyi Lab, Alibaba Group
C
Chongjia Ni
Tongyi Lab, Alibaba Group
Z
Zexu Pan
Tongyi Lab, Alibaba Group
Y
Yiping Peng
Tongyi Lab, Alibaba Group
Zhendong Peng
Zhendong Peng
Tsinghua University
ASR
Peiyao Wang
Peiyao Wang
Stony Brook University
computer vision
H
Hao Wang
Tongyi Lab, Alibaba Group
W
Wen Wang
Tongyi Lab, Alibaba Group
W
Wupeng Wang
Tongyi Lab, Alibaba Group
Biao Tian
Biao Tian
Alibaba DAMO Academy
Signal ProcessingAcousticsRoboticsMachine Learning
Z
Zhentao Tan
Tongyi Lab, Alibaba Group
N
Nan Yang
Tongyi Lab, Alibaba Group
B
Bin Yuan
Tongyi Lab, Alibaba Group