Unified Open-World Segmentation with Multi-Modal Prompts

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary segmentation and in-context segmentation approaches suffer from architectural fragmentation, objective misalignment, and heterogeneous representation learning. Method: We propose COSINE—the first unified multimodal prompt-driven model for both tasks—leveraging joint text-image prompts to extract cross-modal features from foundation models and introducing a novel SegDecoder for fine-grained cross-modal alignment and interactive modeling, enabling pixel- to instance-level mask generation. Contribution/Results: COSINE is the first framework to unify the two dominant open-world segmentation paradigms under a single multimodal prompting architecture, facilitating synergistic bimodal enhancement. On standard benchmarks, it significantly outperforms unimodal baselines and prior dual-task methods, demonstrating that multimodal prompt fusion substantially improves generalization capability across diverse segmentation scenarios.

Technology Category

Application Category

📝 Abstract
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
Problem

Research questions and friction points this paper is trying to address.

Unifying open-world segmentation with multi-modal prompts
Overcoming architectural discrepancies in segmentation pipelines
Improving generalization through visual and textual synergy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified segmentation model with multi-modal prompts
Aligns image and prompt representations via SegDecoder
Overcomes architectural discrepancies in prior segmentation methods
🔎 Similar Papers
No similar papers found.
Y
Yang Liu
Zhejiang University
Y
Yufei Yin
Hangzhou Dianzi University
C
Chenchen Jing
Zhejiang University of Technology
Muzhi Zhu
Muzhi Zhu
Zhejiang University
Computer VisionMachine Learning
H
Hao Chen
Zhejiang University
Y
Yuling Xi
Zhejiang University
Bo Feng
Bo Feng
Professor of Communication, University of California, Davis
Technologically-mediated CommunicationSupportive CommunicationIntercultural CommunicationPhysician-patient Interaction
H
Hao Wang
Apple
S
Shiyu Li
Apple
Chunhua Shen
Chunhua Shen
Zhejiang University
Computer VisionMachine Learning