What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the ambiguity surrounding performance gains in vision-language models trained with tool-augmented reinforcement learning: whether improvements stem from better tool usage or enhanced intrinsic capabilities. To disentangle these factors, we propose the MED (Measure–Explain–Diagnose) framework—the first systematic approach to isolate tool-induced effects from intrinsic learning contributions. Through fine-grained, checkpoint-level evaluations across multiple vision-language models and benchmarks, we find that observed performance gains primarily arise from strengthened intrinsic reasoning abilities. Crucially, reinforcement learning with tools mainly serves to reduce invocation errors and mitigate interference from tool schemata, rather than substantially improving tool mastery or correcting fundamental failures in the model’s internal reasoning.

Technology Category

Application Category

📝 Abstract

Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.

Problem

Research questions and friction points this paper is trying to address.

vision tool-use

reinforcement learning

intrinsic capability

tool-induced effect

crop-and-zoom

Innovation

Methods, ideas, or system contributions that make the work stand out.

tool-use reinforcement learning

intrinsic capability

tool-induced effect