Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the weak generalization capability of large multimodal models (LMMs) in pixel-level understanding tasks—such as camouflaged object detection (COD) and salient object detection (SOD)—which typically rely on text supervision or extensive architectural modifications. We propose a purely reinforcement learning (RL)-driven prompt generation method that eliminates both requirements. Specifically, we introduce Group Relative Policy Optimization (GRPO)—the first application of this RL algorithm to segmentation—to train a policy network that autonomously generates point or box prompts for guiding SAM2. Supervision is limited to image-mask pairs only; no textual annotations or model architecture changes are needed. The approach enables open-world zero-shot transfer: it achieves 0.873 S-measure on COD10K, and attains 71.4% cIoU on RefCOCOg and 56.7% gIoU on ReasonSeg—surpassing fully supervised baselines.

Technology Category

Application Category

📝 Abstract

We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.

Problem

Research questions and friction points this paper is trying to address.

Enhancing pixel-level understanding in large multimodal models using RL

Improving segmentation tasks like camouflaged and salient object detection

Achieving open-world generalization with pure RL-based training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for pixel-level understanding

Group Relative Policy Optimization in segmentation

Pure RL training enables open-world generalization

🔎 Similar Papers

No similar papers found.