Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

220K/year
🤖 AI Summary
This work addresses the challenges of automatic speech recognition in complex real-world environments, where compound acoustic distortions often lead to information omission and hallucination. To tackle these issues, the authors propose a unified in-the-wild speech recognition framework centered on two key contributions: the construction of Voices-in-the-Wild-2M, a scalable large-scale dataset simulating composite acoustic distortions, and the development of an acoustic–semantic progressive supervised fine-tuning approach coupled with a dual-granularity WER-gated optimization strategy. Evaluated on challenging acoustic benchmarks such as VOiCES and NOIZEUS, the proposed method significantly outperforms current state-of-the-art systems, achieving a relative word error rate reduction of over 30% under compound distortion conditions.
📝 Abstract
Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01% on VOiCES R4-B-F, and 21.49% vs. 29.34% on NOIZEUS Sta-0). On complex compositional acoustic scenarios, Mega-ASR further delivers over 30% relative WER reduction against strong open- and closed-source baselines, establishing a scalable paradigm for robust ASR in-the-wild.
Problem

Research questions and friction points this paper is trying to address.

acoustic robustness
in-the-wild speech recognition
compositional distortions
automatic speech recognition
real-world environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

in-the-wild ASR
acoustic robustness
compound acoustic simulation
progressive supervised fine-tuning
WER-gated optimization
🔎 Similar Papers
No similar papers found.