Reinforced Visual Perception with Tools

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) suffer from heavy reliance on costly annotated data, poor generalization, and difficulty in jointly orchestrating multiple visual tools (e.g., OCR, object detection, image captioning) for complex visual reasoning. Method: We propose ReVPT—a novel end-to-end reinforcement learning framework based on Generalized Reward-based Policy Optimization (GRPO)—that unifies the optimization of tool selection and composition policies directly from interactive feedback, eliminating dependence on curated supervised fine-tuning data. Contribution/Results: ReVPT is the first method to enable joint, policy-driven reasoning over heterogeneous visual tools. It achieves state-of-the-art performance on high-perception-density benchmarks—including SAT, CV-Bench, BLINK, and MMStar. Notably, ReVPT-3B and ReVPT-7B outperform strong instruction-tuned baselines by 9.03% and 9.44% on CV-Bench, respectively, demonstrating substantial gains in cross-task generalization and complex multimodal reasoning capability.

Technology Category

Application Category

📝 Abstract
Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various perceptual tasks, leveraging these for general visual reasoning remains challenging. Prior work demonstrates that augmenting LLMs with vision models via supervised finetuning improves performance, but faces key limitations such as expensive data generation, reliance on careful data filtering, and poor generalization. To address these issues, we propose ReVPT to enhance multi-modal LLMs' abilities to reason about and use visual tools through reinforcement learning. We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools. Through extensive experiments, we show that our method achieves state-of-the-art performance on several perception-heavy benchmarks, including SAT, CV-Bench, BLINK and MMStar, significantly outperforming the supervised and text-based RL finetuning baselines. Notably, Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench. Finally, we bring to the community new insights on RL-based visual tool-usage through extensive ablations. Our code is available at https://github.com/ls-kelvin/REVPT.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-modal LLMs' visual reasoning with tools
Overcoming limitations of supervised finetuning for visual perception
Improving generalization through reinforcement learning for tool usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for visual tool usage
GRPO-based algorithm trains multi-modal LLMs
State-of-the-art performance on perception benchmarks
🔎 Similar Papers
No similar papers found.