🤖 AI Summary
This work addresses the limitations of existing multilingual automatic speech recognition (ASR) and phoneme alignment systems in real-world scenarios, particularly regarding accuracy, efficiency, and generalization. We propose the Qwen3-ASR model family, comprising two end-to-end multilingual ASR models and a non-autoregressive forced alignment model, supporting speech recognition in 52 languages/dialects and precise timestamp alignment for 11 languages. Built upon the Qwen3-Omni foundation model, our approach uniquely integrates large language model–based audio understanding into a lightweight ASR framework, leveraging non-autoregressive architecture, large-scale multilingual data, joint language–speech modeling, and high-concurrency inference optimization. Among them, Qwen3-ASR-1.7B achieves state-of-the-art performance among open-source models, rivaling leading commercial APIs, while Qwen3-ASR-0.6B attains an average first-token latency of 92 ms and processes 2000 seconds of audio per second at 128-way concurrency; the alignment model surpasses current best methods in accuracy, efficiency, and multilingual coverage.
📝 Abstract
In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.