🤖 AI Summary
This work addresses the fundamental limitation in large vision-language models (VLMs)—the decoupling of visual and linguistic reasoning and the absence of native “thinking with images.” To this end, we propose DeepEyes, a novel VLM trained end-to-end via proximal policy optimization (PPO) to directly optimize joint vision-language decision-making, without supervised fine-tuning or external multi-model ensembles. Key innovations include tool-oriented data filtering, vision-grounded reward shaping, and multi-stage trajectory modeling, which collectively induce emergent, image-driven, tool-calling reasoning. Experiments demonstrate that DeepEyes achieves significant gains on fine-grained perception-and-reasoning benchmarks, with improved vision grounding, reduced hallucination rates, and concurrent enhancement in mathematical reasoning. Crucially, our analysis uncovers, for the first time, the dynamic evolution of human-like visual thinking—from exploratory image engagement to efficient, goal-directed utilization—thereby revealing intrinsic developmental principles underlying vision-language reasoning.
📝 Abstract
Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with"thinking with images"capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.