Thinking with Programming Vision: Towards a Unified View for Thinking with Images

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) face three key bottlenecks in image tool invocation: (1) reliance on closed, static toolsets; (2) severe sensitivity to image orientation and natural corruptions (e.g., rotation, degradation); and (3) limited fault tolerance and multi-step coordination capability. To address these, we propose CodeVision—a framework that uses code generation as a universal interface for dynamically invoking arbitrary image processing tools, enabling robust and scalable multi-step visual reasoning. Our contributions include: (i) the first systematic analysis revealing MLLMs’ pronounced sensitivity to image orientation changes; (ii) a “code-as-tool” paradigm supporting flexible tool composition, runtime error recovery, and feedback integration; and (iii) a dense process reward mechanism coupled with a two-stage training pipeline (supervised fine-tuning followed by reinforcement learning). Evaluated on Qwen2.5-VL and Qwen3-VL, CodeVision significantly improves robustness, chain execution efficiency, and multi-tool coordination, validated on a novel benchmark.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.
Problem

Research questions and friction points this paper is trying to address.

MLLMs lack robust tool-based reasoning for visual inputs under simple corruptions
Current approaches use narrow tool sets with limited real-world scalability
Existing methods lack flexible code interfaces for universal image operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Code-as-tool framework for universal image operations
Two-stage training with SFT and dense RL rewards
New datasets and benchmark for robustness evaluation
🔎 Similar Papers
No similar papers found.
Z
Zirun Guo
Zhejiang University, ByteDance, BandAI
Minjie Hong
Minjie Hong
Zhejiang University
Multi-modal LearningLLMReinforcement learningGenerative RetrievalRecommendation
F
Feng Zhang
ByteDance, BandAI
Kai Jia
Kai Jia
MIT
T
Tao Jin
Zhejiang University