🤖 AI Summary
This work addresses the ambiguity surrounding performance gains in vision-language models trained with tool-augmented reinforcement learning: whether improvements stem from better tool usage or enhanced intrinsic capabilities. To disentangle these factors, we propose the MED (Measure–Explain–Diagnose) framework—the first systematic approach to isolate tool-induced effects from intrinsic learning contributions. Through fine-grained, checkpoint-level evaluations across multiple vision-language models and benchmarks, we find that observed performance gains primarily arise from strengthened intrinsic reasoning abilities. Crucially, reinforcement learning with tools mainly serves to reduce invocation errors and mitigate interference from tool schemata, rather than substantially improving tool mastery or correcting fundamental failures in the model’s internal reasoning.
📝 Abstract
Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.