V-Thinker: Interactive Thinking with Images

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal models (LMMs) suffer from limited visual tool spaces and task-specific workflows, hindering fine-grained image interaction and long-horizon reasoning. To address this, we propose a vision-centric interactive reasoning paradigm, introducing a “data evolution flywheel” that automatically generates high-quality, multi-difficulty interactive reasoning data. We further design a vision-progressive training curriculum integrating point-level supervision with a two-stage reinforcement learning framework, enabling end-to-end optimization of perceptual alignment and interactive reasoning. Our method substantially improves both general and interactive reasoning capabilities, achieving comprehensive superiority over state-of-the-art LMMs on our expert-validated benchmark VTBench—a newly constructed evaluation suite for visual interactive reasoning. This work establishes a new benchmark and a scalable technical pathway for image-driven interactive cognitive modeling.

Technology Category

Application Category

📝 Abstract
Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising"Thinking with Images"paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
Problem

Research questions and friction points this paper is trying to address.

Integrating image interaction with long-horizon reasoning in multimodal models
Overcoming limited visual tool spaces and task-specific workflow designs
Advancing image-interactive reasoning through reinforcement learning frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end reinforcement learning for vision-centric thinking
Data Evolution Flywheel synthesizes interactive reasoning datasets
Visual Progressive Training Curriculum with two-stage reinforcement learning
🔎 Similar Papers
No similar papers found.
R
Runqi Qiao
Beijing University of Posts and Telecommunications
Q
Qiuna Tan
Beijing University of Posts and Telecommunications
Minghan Yang
Minghan Yang
Minghong Investment, Shanghai
optimizationmachine learning
Guanting Dong
Guanting Dong
Remin University of China
LLM Reasoning & AlignmentDeep Search AgentAgentic RL
Peiqing Yang
Peiqing Yang
Nanyang Technological University
Computer VisionImage ProcessingMachine Learning
S
Shiqiang Lang
Beijing University of Posts and Telecommunications
E
Enhui Wan
Beijing University of Posts and Telecommunications
X
Xiaowan Wang
Beijing University of Posts and Telecommunications
Y
Yida Xu
Beijing University of Posts and Telecommunications
Lan Yang
Lan Yang
Edwin & Florence Skinner Professor, Electrical & Systems Engineering, Washington Univ. in St Louis
resonatorlasernonlinear opticssensingnon-Hermitian physics
Chong Sun
Chong Sun
Tencent WeChat
Computer Vision
C
Chen Li
WeChat Vision, Tencent Inc.
H
Honggang Zhang
Beijing University of Posts and Telecommunications