ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model

📅 2024-11-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing open-vocabulary segmentation models rely on predefined category prompts and produce only sparse predictions, limiting free-text-driven pixel-level mask generation and automatic discovery of unseen categories. To address this, we propose the first patch-wise perception paradigm for open-set dense segmentation, enabling joint dense and sparse mask prediction. We introduce an instruction-response dialogue fine-tuning mechanism to transcend closed-set category constraints. Our approach integrates multimodal large model (LMM) vision-language alignment, patch-wise attention modeling, instruction tuning, and iterative text-guided mask refinement. Evaluated on comprehensive multi-task open-set segmentation benchmarks, our method achieves state-of-the-art performance. Notably, it is the first framework to unify zero-shot category generation, high-precision dense segmentation, and fine-grained semantic correction within a single architecture—enabling both open-vocabulary expressivity and dense spatial reasoning.

Technology Category

Application Category

📝 Abstract

Advances in CLIP and large multimodal models (LMMs) have enabled open-vocabulary and free-text segmentation, yet existing models still require predefined category prompts, limiting free-form category self-generation. Most segmentation LMMs also remain confined to sparse predictions, restricting their applicability in open-set environments. In contrast, we propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation through patch-wise perception. Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously. Additionally, a newly designed instruction-response paradigm takes full advantage of the generation and generalization capabilities of LMMs, achieving category prediction independent of closed-set constraints or predefined categories. To further enhance mask detail and category precision, we introduce a conversation-based refinement paradigm, integrating the prediction result from previous step with textual prompt for revision. Extensive experiments demonstrate that ROSE achieves competitive performance across various segmentation tasks in a unified framework. Code will be released.

Problem

Research questions and friction points this paper is trying to address.

Enables dense mask prediction and open-category generation.

Overcomes predefined category prompts in segmentation models.

Enhances mask detail and category precision through refinement.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-wise perception enables dense mask prediction

Instruction-response paradigm enhances category generalization

Conversation-based refinement improves mask detail precision

🔎 Similar Papers

No similar papers found.