Reinforced Visual Perception with Tools

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from heavy reliance on costly annotated data, poor generalization, and difficulty in jointly orchestrating multiple visual tools (e.g., OCR, object detection, image captioning) for complex visual reasoning. Method: We propose ReVPT—a novel end-to-end reinforcement learning framework based on Generalized Reward-based Policy Optimization (GRPO)—that unifies the optimization of tool selection and composition policies directly from interactive feedback, eliminating dependence on curated supervised fine-tuning data. Contribution/Results: ReVPT is the first method to enable joint, policy-driven reasoning over heterogeneous visual tools. It achieves state-of-the-art performance on high-perception-density benchmarks—including SAT, CV-Bench, BLINK, and MMStar. Notably, ReVPT-3B and ReVPT-7B outperform strong instruction-tuned baselines by 9.03% and 9.44% on CV-Bench, respectively, demonstrating substantial gains in cross-task generalization and complex multimodal reasoning capability.

Technology Category

Application Category

📝 Abstract

Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-modal LLMs' visual reasoning with tools

Overcoming limitations of supervised finetuning for visual perception

Improving generalization through reinforcement learning for tool usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for visual tool usage

GRPO-based algorithm trains multi-modal LLMs

State-of-the-art performance on perception benchmarks

🔎 Similar Papers

No similar papers found.