MARS: a Multimodal Alignment and Ranking System for Few-Shot Segmentation

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing few-shot segmentation methods rely solely on visual similarity for mask selection, leading to suboptimal ranking. To address this, we propose a plug-and-play multimodal alignment and ranking framework that adapts to any mask proposal generator without retraining. Our key innovation is the first introduction of a dual-granularity (local and global) multimodal scoring mechanism, which jointly fuses complementary cues from text, vision, geometry, and semantics for robust mask scoring, selection, and weighted fusion. Extensive experiments demonstrate new state-of-the-art performance on four benchmarks—COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000—yielding significant improvements over prevailing methods. The code is publicly available, facilitating reproducibility and advancing the practical deployment of few-shot segmentation.

Technology Category

Application Category

📝 Abstract
Current Few Shot Segmentation literature lacks a mask selection method that goes beyond visual similarity between the query and example images, leading to suboptimal predictions. We present MARS, a plug-and-play ranking system that leverages multimodal cues to filter and merge mask proposals robustly. Starting from a set of mask predictions for a single query image, we score, filter, and merge them to improve results. Proposals are evaluated using multimodal scores computed at local and global levels. Extensive experiments on COCO-20i, Pascal-5i, LVIS-92i, and FSS-1000 demonstrate that integrating all four scoring components is crucial for robust ranking, validating our contribution. As MARS can be effortlessly integrated with various mask proposal systems, we deploy it across a wide range of top-performer methods and achieve new state-of-the-art results on multiple existing benchmarks. Code will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Lacks mask selection method beyond visual similarity
Needs robust multimodal ranking for few-shot segmentation
Requires plug-and-play system to improve mask proposals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal ranking system for segmentation
Local and global multimodal scoring
Plug-and-play integration with top methods
🔎 Similar Papers
No similar papers found.