Qwen3-Omni Technical Report

πŸ“… 2025-09-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper proposes Thinker-Talker, a Mixture-of-Experts (MoE) architecture addressing the challenge of achieving state-of-the-art (SOTA) performance across diverse modalities within a unified multimodal large language model framework. Methodologically, it integrates multi-codebook discrete speech coding, block-sparse MoE, lightweight causal ConvNets (replacing diffusion models for sub-234ms first-frame latency), and an explicit cross-modal reasoning mechanism. The model supports text processing in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages, enabling streaming text-to-speech and joint multimodal inference. Evaluated on 36 audio and audio-visual benchmarks, Thinker-Talker outperforms all existing open-source models on 32 tasks and achieves overall SOTA on 22β€”surpassing closed-source systems including Gemini-2.5-Pro and GPT-4o-Transcribe. We publicly release the 30B-A3B model series and a high-performance audio captioning model.

Technology Category

Application Category

πŸ“ Abstract
We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
Problem

Research questions and friction points this paper is trying to address.

Developing a single multimodal model maintaining top performance across text, image, audio, and video
Reducing first-packet latency in streaming speech synthesis for real-time applications
Enhancing multimodal reasoning capabilities across different input modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Thinker-Talker MoE architecture unifying multimodal perception and generation
Multi-codebook scheme for autoregressive prediction of discrete speech codecs
Lightweight causal ConvNet replacing block-wise diffusion for streaming synthesis
πŸ”Ž Similar Papers
No similar papers found.