AdaTooler-V: Adaptive Tool-Use for Images and Videos

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing open-source multimodal large language models (MLLMs) suffer from blind visual tool invocation, leading to excessive inference overhead and degraded performance. To address this, we propose AdaTooler—a novel framework featuring adaptive tool invocation. Our method introduces AT-GRPO, a sample-level reward-scaling reinforcement learning algorithm that dynamically adjusts rewards based on per-sample tool utility; constructs the first verifiable RL training dataset covering single-image, multi-image, and video modalities; and integrates multimodal chain-of-thought reasoning, explicit tool-call decision modeling, and hybrid supervised fine-tuning (SFT) cold-start plus RL training. Evaluated across 12 benchmarks, AdaTooler achieves state-of-the-art performance: AdaTooler-V-7B attains 89.8% accuracy on the high-resolution V* benchmark—substantially outperforming GPT-4o and Gemini 1.5 Pro.

Technology Category

Application Category

📝 Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.

Problem

Research questions and friction points this paper is trying to address.

Reduces unnecessary vision tool usage in multimodal models

Adaptively decides when visual problems need tool assistance

Improves reasoning accuracy while lowering inference overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive tool-use via reinforcement learning algorithm

Two datasets for training with verifiable rewards

Outperforms commercial models on visual reasoning benchmarks

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning