🤖 AI Summary
This work addresses the challenge that multimodal large language models struggle to effectively acquire tool-use capabilities in complex visual reasoning tasks. To overcome this limitation, the authors propose ToolsRL, a novel reinforcement learning framework featuring a two-stage curriculum training mechanism. Initially, tool-specific rewards are employed to optimize fundamental tool invocation skills; subsequently, end-to-end joint training integrates task accuracy to decouple conflicting optimization objectives across heterogeneous goals. By leveraging interpretable native visual tools—such as zooming, rotation, and drawing—and their associated supervisory signals, ToolsRL simultaneously enhances both tool proficiency and visual reasoning performance. Experimental results demonstrate that ToolsRL significantly outperforms existing approaches, exhibiting superior capability in orchestrating multiple tools and conducting complex visual reasoning.
📝 Abstract
In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those heterogeneous tasks. Our experiments have shown that the tool-supervised curriculum training is efficient and ToolsRL can achieve strong tool-use capabilities for complex visual reasoning tasks.