🤖 AI Summary
Existing text-prompted image segmentation methods lack explicit chain-of-thought (CoT) reasoning, resulting in limited generalization to unseen prompts and out-of-domain scenarios. To address this, we propose the first end-to-end unified reinforcement learning framework that jointly optimizes CoT reasoning and mask generation. Built upon Qwen2.5-VL-3B-Instruct, our method introduces a multi-granularity reward mechanism—incorporating sentence-level semantics, bounding-box-level localization, and pixel-level segmentation cues—to enable synergistic optimization of semantic understanding and mask prediction. This design significantly enhances model interpretability and cross-prompt/cross-domain generalization. Evaluated on RefCOCO, RefCOCO+, and RefCOCOg, our approach achieves a mean cIoU of 81.2%, outperforming the strong baseline GLaMM by 5.6 percentage points. The results validate both the effectiveness and state-of-the-art performance of our framework.
📝 Abstract
Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.