Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) lack self-reflection and error-correction capabilities in multi-turn visual reasoning. Method: We propose a verifiable, self-reflective multi-step visual reasoning framework that actively invokes external tools for image analysis—rather than passively perceiving inputs—and employs a redundancy-penalized reinforcement learning (RL) strategy to encourage multi-scale exploration and trajectory-level self-correction. We further construct a challenging, answer-verifiable multi-turn visual question-answering dataset. Our approach integrates high-resolution image inputs, cold-start supervised fine-tuning (SFT), and redundancy-aware RL to support iterative tool invocation and reasoning trajectory assessment. Contribution/Results: Experiments demonstrate significant improvements in multi-step reasoning accuracy, robustness, and self-correction capability across multiple visual understanding benchmarks, establishing a new paradigm for trustworthy visual reasoning.

Technology Category

Application Category

📝 Abstract
Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Enables deep, reliable multi-turn reasoning with images
Addresses self-reflection and correction in visual reasoning trajectories
Improves performance on complex visual understanding benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-turn tool calls for complex visual tasks
Redundancy-penalized RL for self-reflective reasoning
Cold-start SFT with high-difficulty verifiable data
🔎 Similar Papers
No similar papers found.
W
Wenhao Yang
National Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yu Xia
AI Business, Alibaba Group
J
Jinlong Huang
AI Business, Alibaba Group
Shiyin Lu
Shiyin Lu
Alibaba Group
Multimodal Large Language ModelsOnline LearningBandits
Qing-Guo Chen
Qing-Guo Chen
alibaba-inc
machine learning
Z
Zhao Xu
AI Business, Alibaba Group
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce
Yuanyu Wan
Yuanyu Wan
Zhejiang University
Machine LearningOnline LearningDistributed Optimization
L
Lijun Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China