Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of multilingual automatic speech recognition (ASR) under cross-accent, diverse speaking styles, and challenging acoustic conditions, this paper introduces Whale—a large-scale, end-to-end multilingual ASR model. Methodologically, Whale pioneers the integration of w2v-BERT self-supervised representations with an E-Branchformer encoder, coupled with a CTC-attention joint decoding architecture to enable efficient synergy between unsupervised pretraining and supervised fine-tuning. The model is trained on a hybrid multilingual speech corpus, including a large in-house dataset. Experiments demonstrate that Whale achieves 2.4% WER on LibriSpeech test-clean and 3.4% CER on CSJ eval3—outperforming Whisper large-v3 and OWSM v3.1. These results indicate substantial improvements in robustness and generalization across multilingual ASR scenarios.

Technology Category

Application Category

📝 Abstract
This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.
Problem

Research questions and friction points this paper is trying to address.

Develop large-scale multilingual ASR model
Enhance robustness to diverse speech conditions
Achieve superior performance on benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses w2v-BERT self-supervised model
Integrates E-Branchformer encoder-decoder backbone
Employs joint CTC-attention decoding strategy
🔎 Similar Papers
No similar papers found.