LENS: Learning to Segment Anything with Unified Reinforced Reasoning

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-prompted image segmentation methods lack explicit chain-of-thought (CoT) reasoning, resulting in limited generalization to unseen prompts and out-of-domain scenarios. To address this, we propose the first end-to-end unified reinforcement learning framework that jointly optimizes CoT reasoning and mask generation. Built upon Qwen2.5-VL-3B-Instruct, our method introduces a multi-granularity reward mechanism—incorporating sentence-level semantics, bounding-box-level localization, and pixel-level segmentation cues—to enable synergistic optimization of semantic understanding and mask prediction. This design significantly enhances model interpretability and cross-prompt/cross-domain generalization. Evaluated on RefCOCO, RefCOCO+, and RefCOCOg, our approach achieves a mean cIoU of 81.2%, outperforming the strong baseline GLaMM by 5.6 percentage points. The results validate both the effectiveness and state-of-the-art performance of our framework.

Technology Category

Application Category

📝 Abstract
Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization in text-prompted image segmentation
Addressing limited chain-of-thought reasoning in segmentation
Enhancing segmentation quality across unseen domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning optimizes reasoning and segmentation
Unified rewards combine sentence, box, and segment cues
End-to-end framework generates chain-of-thought rationales
🔎 Similar Papers
No similar papers found.
L
Lianghui Zhu
School of EIC, Huazhong University of Science & Technology
B
Bin Ouyang
School of EIC, Huazhong University of Science & Technology
Y
Yuxuan Zhang
School of EIC, Huazhong University of Science & Technology
Tianheng Cheng
Tianheng Cheng
ByteDance Seed
Computer VisionObject DetectionInstance SegmentationMultimodal ModelsAutonomous Driving
R
Rui Hu
School of EIC, Huazhong University of Science & Technology
Haocheng Shen
Haocheng Shen
AI Lab, vivo
MR Brain ImagingMedical Image AnalysisComputer VisionMachine LearningDeep Learning
L
Longjin Ran
vivo AI Lab
Xiaoxin Chen
Xiaoxin Chen
Coriell Institute for Medical Research
L
Li Yu
School of EIC, Huazhong University of Science & Technology
W
Wenyu Liu
School of EIC, Huazhong University of Science & Technology
Xinggang Wang
Xinggang Wang
Professor, Huazhong University of Science and Technology
Artificial IntelligenceComputer VisionAutonomous DrivingObject DetectionObject Segmentation