Qwen3-VL Technical Report

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work introduces Qwen3-VL—the most capable vision-language model in the Qwen series—designed to unify text understanding, ultra-long-context modeling (up to 256K tokens), and multimodal reasoning across single/multiple images and video. Methodologically, it proposes an enhanced interleaved MRoPE positional encoding, a DeepStack cross-modal fusion architecture, and a fine-grained temporal alignment mechanism; integrates dense and Mixture-of-Experts (MoE) components; and employs multi-level ViT feature fusion with text-guided temporal modeling. Experiments demonstrate state-of-the-art performance on major multimodal benchmarks—including MMMU and MathVista—while significantly advancing long-document parsing, high-precision video temporal localization, and cross-modal referencing. Qwen3-VL establishes a robust foundation for multimodal agents, visual reasoning, and multimodal code generation.

Technology Category

Application Category

📝 Abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal understanding across text, images, and video inputs
Improving long-context comprehension for documents and videos up to 256K tokens
Advancing reasoning capabilities for visual, mathematical, and cross-modal tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhanced interleaved-MRoPE for spatial-temporal modeling
DeepStack integration for vision-language alignment
Text-based time alignment for precise video grounding
🔎 Similar Papers
No similar papers found.
Shuai Bai
Shuai Bai
Qwen Team, Alibaba Group
Multi-Modal LearningVisual Generation
Y
Yuxuan Cai
Qwen Team
Ruizhe Chen
Ruizhe Chen
Zhejiang University
LLMMLLM
Keqin Chen
Keqin Chen
Beihang University
Large Language ModelMultimodal
X
Xionghui Chen
Qwen Team
Zesen Cheng
Zesen Cheng
Peking University
MLLMVideo LLMVisual GroundingImage/Video Segmentation
L
Lianghao Deng
Qwen Team
W
Wei Ding
Qwen Team
C
Chang Gao
Qwen Team
C
Chunjiang Ge
Qwen Team
W
Wenbin Ge
Qwen Team
Zhifang Guo
Zhifang Guo
Institute of Computing Technology Chinese Academy of Sciences
MultiModal/Speech/Sound/NLP
Qidong Huang
Qidong Huang
Qwen Team, Alibaba Cloud
vision and language
J
Jie Huang
Qwen Team
F
Fei Huang
Qwen Team
Binyuan Hui
Binyuan Hui
Qwen Team, Alibaba Group
Large Language ModelsCodeLLMsReasoningAgent
S
Shutong Jiang
Qwen Team
Z
Zhaohai Li
Qwen Team
Mingsheng Li
Mingsheng Li
Bowling Green State University
Corporate financedividend policymutual fundETFs
M
Mei Li
Qwen Team
Kaixin Li
Kaixin Li
National University of Singapore
Machine LearningNatural Language ProcessingCode IntelligenceGUI Agents
Z
Zicheng Lin
Qwen Team
Junyang Lin
Junyang Lin
Qwen Team, Alibaba Group & Peking University
Natural Language ProcessingCross-Modal Representation LearningPretraining
Xuejing Liu
Xuejing Liu
ICT, UCAS
computer vision
J
Jiawei Liu
Qwen Team
C
Chenglong Liu
Qwen Team
Y
Yang Liu
Qwen Team
D
Dayiheng Liu
Qwen Team
Shixuan Liu
Shixuan Liu
National University of Defense Technology
Knowledge ReasoningDomain GeneralizationCausal InferenceData Engineering
Dunjie Lu
Dunjie Lu
Bachelor of Computer Science, Sun Yat-sen University
AIMLLLMVLMAgent
Ruilin Luo
Ruilin Luo
Tsinghua University
LLM ReasoningGraph Learning
C
Chenxu Lv
Qwen Team
Rui Men
Rui Men
Qwen Team, Alibaba Group & Peking University
NLP
Lingchen Meng
Lingchen Meng
Qwen Team, Alibaba Group; Fudan University
Large Multimodal Models
X
Xuancheng Ren
Qwen Team
X
Xingzhang Ren
Qwen Team
Sibo Song
Sibo Song
Alibaba
computer visiondeep learningmultimodal learning
Yuchong Sun
Yuchong Sun
Renmin University of China
Vision-Language
J
Jun Tang
Qwen Team
J
Jianhong Tu
Qwen Team
Jianqiang Wan
Jianqiang Wan
Alibaba Group
P
Peng Wang
Qwen Team
P
Pengfei Wang
Qwen Team
Qiuyue Wang
Qiuyue Wang
School of Information, Renmin University of China
information extractionknowledge graphknowledge reasoning
Y
Yuxuan Wang
Qwen Team
Tianbao Xie
Tianbao Xie
University of Hong Kong
Artificial IntelligenceDeep LearningNatural Language Processing
Yiheng Xu
Yiheng Xu
University of Hong Kong
Natural Language Processing
H
Haiyang Xu
Qwen Team
J
Jin Xu
Qwen Team
Z
Zhibo Yang
Qwen Team
M
Mingkun Yang
Qwen Team
J
Jianxin Yang
Qwen Team
An Yang
An Yang
Qwen Team, Peking University
Nature Language Processing (NLP)
Bowen Yu
Bowen Yu
Qwen Team, Alibaba Group
Post-trainingFoundation Model
Fei Zhang
Fei Zhang
Shanghai Jiao Tong University
Machine LearningComputer Vision
H
Hang Zhang
Qwen Team
X
Xi Zhang
Qwen Team
B
Bo Zheng
Qwen Team
H
Humen Zhong
Qwen Team
Jingren Zhou
Jingren Zhou
Alibaba Group, Microsoft
Cloud ComputingLarge Scale Distributed SystemsMachine LearningQuery ProcessingQuery
F
Fan Zhou
Qwen Team
J
Jing Zhou
Qwen Team
Y
Yuanzhi Zhu
Qwen Team
K
Ke Zhu
Qwen Team