🤖 AI Summary
Existing open-vocabulary segmentation and in-context segmentation approaches suffer from architectural fragmentation, objective misalignment, and heterogeneous representation learning. Method: We propose COSINE—the first unified multimodal prompt-driven model for both tasks—leveraging joint text-image prompts to extract cross-modal features from foundation models and introducing a novel SegDecoder for fine-grained cross-modal alignment and interactive modeling, enabling pixel- to instance-level mask generation. Contribution/Results: COSINE is the first framework to unify the two dominant open-world segmentation paradigms under a single multimodal prompting architecture, facilitating synergistic bimodal enhancement. On standard benchmarks, it significantly outperforms unimodal baselines and prior dual-task methods, demonstrating that multimodal prompt fusion substantially improves generalization capability across diverse segmentation scenarios.
📝 Abstract
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g., text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.