VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models often employ excessive visual tokens, leading to computational redundancy; while token compression preserves accuracy on general VQA tasks, it severely degrades performance on fine-grained tasks like OCR. To address this, we propose VisionThink—a reinforcement learning–based framework for dynamic resolution adaptation. The model autonomously decides whether to upsample image resolution for the current task, guided by a fine-grained reward signal generated by an LLM-as-Judge mechanism that evaluates task-specific correctness. We introduce a dedicated request token and a custom reward-penalty scheme to enable intelligent, demand-driven visual token compression. Experiments demonstrate that VisionThink maintains state-of-the-art accuracy across diverse multimodal benchmarks—including zero degradation in OCR accuracy—while reducing average visual token count by 42% and significantly lowering inference overhead.

Technology Category

Application Category

📝 Abstract
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
Problem

Research questions and friction points this paper is trying to address.

Dynamic visual token compression for efficient vision-language models
Smart resolution adjustment based on task complexity
Balancing performance and efficiency in OCR and VQA tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic resolution processing for visual tokens
Reinforcement learning with LLM-as-Judge strategy
Smart token compression via case-by-case decisions
🔎 Similar Papers
No similar papers found.