DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit insufficient fine-grained visual perception in complex real-world scenarios—such as densely crowded scenes—limiting their ability to resolve ambiguous or occluded regions. Method: We propose DIP-R1, a reinforcement learning–based deep inspection and perception framework. It introduces a novel triple-regularized reward mechanism: (i) stepwise reasoning reward, (ii) variance-guided gaze reward, and (iii) weighted precision-recall reward—jointly driving active inspection of ambiguous regions and uncertainty-aware modeling. DIP-R1 integrates proximal policy optimization (PPO), MLLMs, variance-driven attention guidance, and task-specific reward modeling. Results: Extensive experiments demonstrate that DIP-R1 significantly outperforms supervised fine-tuning and state-of-the-art baselines across diverse in-domain and out-of-domain fine-grained object detection benchmarks, validating its strong generalization capability and robustness under distributional shift.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of reinforcement learning (RL) in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings. First, we adopt a standard reasoning reward encouraging the model to include three step-by-step processes: 1) reasoning for understanding visual scenes, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to examine uncertain regions for the second observing process. It explicitly enables the model to inspect ambiguous areas, improving its ability to mitigate perceptual uncertainties. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We explore its effectiveness across diverse fine-grained object detection data consisting of challenging real-world environments, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios. It also outperforms various existing baseline models and supervised fine-tuning methods. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained visual perception in complex scenes
Improving MLLMs' ability to inspect ambiguous regions
Boosting decision-making accuracy in real-world environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

RL-based framework enhances MLLMs perception
Rule-based rewards guide detailed scene inspection
Improves accuracy in complex crowded scenes
🔎 Similar Papers
No similar papers found.