Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the weak generalization capability of large multimodal models (LMMs) in pixel-level understanding tasks—such as camouflaged object detection (COD) and salient object detection (SOD)—which typically rely on text supervision or extensive architectural modifications. We propose a purely reinforcement learning (RL)-driven prompt generation method that eliminates both requirements. Specifically, we introduce Group Relative Policy Optimization (GRPO)—the first application of this RL algorithm to segmentation—to train a policy network that autonomously generates point or box prompts for guiding SAM2. Supervision is limited to image-mask pairs only; no textual annotations or model architecture changes are needed. The approach enables open-world zero-shot transfer: it achieves 0.873 S-measure on COD10K, and attains 71.4% cIoU on RefCOCOg and 56.7% gIoU on ReasonSeg—surpassing fully supervised baselines.

Technology Category

Application Category

📝 Abstract
We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.
Problem

Research questions and friction points this paper is trying to address.

Enhancing pixel-level understanding in large multimodal models using RL
Improving segmentation tasks like camouflaged and salient object detection
Achieving open-world generalization with pure RL-based training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for pixel-level understanding
Group Relative Policy Optimization in segmentation
Pure RL training enables open-world generalization
🔎 Similar Papers
No similar papers found.