Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of directly applying GRPO (Group Relative Policy Optimization) training—originally designed for reasoning tasks in language models—to vision-language models for perceptual tasks such as image segmentation. Recognizing the fundamental differences between perception- and reasoning-oriented objectives, the authors propose a plug-and-play GRPO framework that requires no architectural modifications. By introducing a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, the method enhances output space coverage and stabilizes reward signals. This approach is the first to explicitly delineate the critical distinctions between perception-driven and reasoning-driven training paradigms, effectively balancing output diversity with fine-grained reward consistency. Experiments demonstrate significant improvements in segmentation performance on complex visual scenes while maintaining strong generalization capabilities.

Technology Category

Application Category

📝 Abstract
Following the success of Group Relative Policy Optimization (GRPO) in foundation LLMs, an increasing number of works have sought to adapt GRPO to Visual Large Language Models (VLLMs) for visual perception tasks (e.g., detection and segmentation). However, much of this line of research rests on a long-standing yet unexamined assumption: training paradigms developed for language reasoning can be transferred seamlessly to visual perception. Our experiments show that this assumption is not valid, revealing intrinsic differences between reasoning-oriented and perception-oriented settings. Using reasoning segmentation as a representative case, we surface two overlooked factors: (i) the need for a broader output space, and (ii) the importance of fine-grained, stable rewards. Building on these observations, we propose Dr.~Seg, a simple, plug-and-play GRPO-based framework consisting of a Look-to-Confirm mechanism and a Distribution-Ranked Reward module, requiring no architectural modifications and integrating seamlessly with existing GRPO-based VLLMs. Extensive experiments demonstrate that Dr.~Seg improves performance in complex visual scenarios while maintaining strong generalization. Code and models will be available at https://github.com/xVI-group-SCU/Dr-Seg.
Problem

Research questions and friction points this paper is trying to address.

Visual Large Language Models
GRPO
visual perception
reasoning segmentation
reward design
Innovation

Methods, ideas, or system contributions that make the work stand out.

GRPO
Visual Large Language Models
Perception-Oriented Design
Reasoning Segmentation
Distribution-Ranked Reward