FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first industrial-grade unified speech recognition system that seamlessly integrates four core modules—voice activity detection (VAD), language identification (LID), punctuation prediction, and automatic speech recognition (ASR)—to enable high-accuracy streaming and non-streaming transcription across Mandarin, English, multiple dialects, accented speech, and code-switching scenarios. The system innovatively combines an ultra-lightweight DFSMN-based VAD with a large-scale ASR model (FireRedASR2-LLM/AED), substantially expanding dialect coverage. It achieves state-of-the-art performance across multiple benchmarks: character error rates of 2.89% for Mandarin and 11.55% for dialects, a VAD F1 score of 97.57% on FLEURS-VAD-102, 97.18% LID accuracy across 82 languages, and a punctuation prediction F1 score of 78.90%, consistently outperforming existing open-source systems.

Technology Category

Application Category

📝 Abstract
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
Voice Activity Detection
Spoken Language Identification
Punctuation Prediction
Multilingual ASR
Innovation

Methods, ideas, or system contributions that make the work stand out.

All-in-One ASR System
State-of-the-Art Performance
Multi-Dialect and Code-Switching Support
Ultra-Lightweight VAD
Unified Speech Processing Pipeline
🔎 Similar Papers
No similar papers found.
K
Kaituo Xu
Super Intelligence Team, Xiaohongshu Inc.
Y
Yan Jia
Super Intelligence Team, Xiaohongshu Inc.
K
Kai Huang
Super Intelligence Team, Xiaohongshu Inc.
J
Junjie Chen
Super Intelligence Team, Xiaohongshu Inc.
W
Wenpeng Li
Super Intelligence Team, Xiaohongshu Inc.
K
Kun Liu
Super Intelligence Team, Xiaohongshu Inc.
F
Feng-Long Xie
Super Intelligence Team, Xiaohongshu Inc.
Xu Tang
Xu Tang
Xiaohongshu. 个人主页: https://tangxuvis.github.io/
Face DetectionFace RecognitionGANVideo UnderstandingText Video Retrieval
Yao Hu
Yao Hu
浙江大学
Machine Learning