Qwen2.5-VL Technical Report

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address limitations of vision-language models in spatial-scale perception, temporal dynamics modeling, and interactive task execution, this paper introduces Qwen2.5-VL—a multi-scale native-resolution vision-language large model. Methodologically, it pioneers a dynamic-resolution Vision Transformer (ViT) with absolute time encoding, enabling native processing of arbitrary image resolutions and hour-long videos; integrates Window Attention, multi-stage alignment training, and a structured output decoder for joint spatiotemporal understanding. The model achieves significant improvements in high-precision object localization (bounding boxes/points), second-level long-video event localization, and complex document/chart structural parsing, while supporting cross-modal reasoning and real-device interaction. Qwen2.5-VL-72B matches GPT-4o and Claude 3.5 Sonnet on document/chart understanding, and enables full-scenario deployment—from edge devices to supercomputers—delivering state-of-the-art visual comprehension alongside Qwen2.5-level language capabilities.

Technology Category

Application Category

📝 Abstract

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual recognition and document parsing for multimodal understanding

Enabling precise object localization and structured data extraction from documents

Processing dynamic-resolution images and long videos with temporal-spatial awareness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic resolution processing with absolute time encoding

Native dynamic-resolution ViT trained with Window Attention

Interactive visual agent for reasoning and tool usage

🔎 Similar Papers

No similar papers found.