Unified Open-World Segmentation with Multi-Modal Prompts

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing open-vocabulary segmentation and in-context segmentation approaches suffer from architectural fragmentation, objective misalignment, and heterogeneous representation learning. Method: We propose COSINE—the first unified multimodal prompt-driven model for both tasks—leveraging joint text-image prompts to extract cross-modal features from foundation models and introducing a novel SegDecoder for fine-grained cross-modal alignment and interactive modeling, enabling pixel- to instance-level mask generation. Contribution/Results: COSINE is the first framework to unify the two dominant open-world segmentation paradigms under a single multimodal prompting architecture, facilitating synergistic bimodal enhancement. On standard benchmarks, it significantly outperforms unimodal baselines and prior dual-task methods, demonstrating that multimodal prompt fusion substantially improves generalization capability across diverse segmentation scenarios.

Technology Category

Application Category

📝 Abstract

In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.

Problem

Research questions and friction points this paper is trying to address.

Unifying open-world segmentation with multi-modal prompts

Overcoming architectural discrepancies in segmentation pipelines

Improving generalization through visual and textual synergy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified segmentation model with multi-modal prompts

Aligns image and prompt representations via SegDecoder

Overcomes architectural discrepancies in prior segmentation methods

🔎 Similar Papers

No similar papers found.