🤖 AI Summary
To address weak generalization and the lack of explicit reasoning in open-domain referring segmentation, this paper proposes Seg-Zero—the first zero-shot reasoning segmentation framework. It employs a decoupled dual-model architecture comprising a reasoning model and a segmentation model: the reasoning model autonomously generates chain-of-thought explanations and spatial prompts, which guide the segmentation model to produce pixel-level masks without supervision, fine-tuning, or inference-time annotations. Crucially, we introduce GRPO-based reinforcement learning with a joint reward mechanism—incorporating both formatting fidelity and segmentation accuracy—to elicit emergent reasoning capabilities at test time. On the ReasonSeg benchmark, Seg-Zero-7B achieves a zero-shot mIoU of 57.5, outperforming LISA-7B by 18%, demonstrating substantial improvements in cross-domain generalization and reasoning interpretability.
📝 Abstract
Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.