FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work proposes the first industrial-grade unified speech recognition system that seamlessly integrates four core modules—voice activity detection (VAD), language identification (LID), punctuation prediction, and automatic speech recognition (ASR)—to enable high-accuracy streaming and non-streaming transcription across Mandarin, English, multiple dialects, accented speech, and code-switching scenarios. The system innovatively combines an ultra-lightweight DFSMN-based VAD with a large-scale ASR model (FireRedASR2-LLM/AED), substantially expanding dialect coverage. It achieves state-of-the-art performance across multiple benchmarks: character error rates of 2.89% for Mandarin and 11.55% for dialects, a VAD F1 score of 97.57% on FLEURS-VAD-102, 97.18% LID accuracy across 82 languages, and a punctuation prediction F1 score of 78.90%, consistently outperforming existing open-source systems.

Technology Category

Application Category

📝 Abstract

We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.

Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition

Voice Activity Detection

Spoken Language Identification

Punctuation Prediction

Multilingual ASR

Innovation

Methods, ideas, or system contributions that make the work stand out.

All-in-One ASR System

State-of-the-Art Performance

Multi-Dialect and Code-Switching Support