Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation

📅 2025-05-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current semantic segmentation research lacks systematic and fair evaluation of visual versus textual prompting paradigms. Method: We introduce SoT—the first unified benchmark for segmentation prompting—covering 14 datasets across 7 scene categories, enabling direct, protocol-consistent comparison of both prompt types. Our approach integrates a confidence-driven multi-class visual prompt mask fusion strategy, open-vocabulary segmentation models, visual reference prompting, multi-domain datasets, and a joint quantitative–qualitative analysis framework. Results: Experiments reveal that textual prompts offer clear concept expressiveness but suffer from poor generalization; visual prompts achieve higher average performance yet exhibit lower stability. This work establishes the first empirically grounded, reproducible benchmark and practical design guidelines for prompting vision foundation models in semantic segmentation.

Technology Category

Application Category

📝 Abstract
Prompt engineering has shown remarkable success with large language models, yet its systematic exploration in computer vision remains limited. In semantic segmentation, both textual and visual prompts offer distinct advantages: textual prompts through open-vocabulary methods allow segmentation of arbitrary categories, while visual reference prompts provide intuitive reference examples. However, existing benchmarks evaluate these modalities in isolation, without direct comparison under identical conditions. We present Show or Tell (SoT), a novel benchmark specifically designed to evaluate both visual and textual prompts for semantic segmentation across 14 datasets spanning 7 diverse domains (common scenes, urban, food, waste, parts, tools, and land-cover). We evaluate 5 open-vocabulary methods and 4 visual reference prompt approaches, adapting the latter to handle multi-class segmentation through a confidence-based mask merging strategy. Our extensive experiments reveal that open-vocabulary methods excel with common concepts easily described by text but struggle with complex domains like tools, while visual reference prompt methods achieve good average results but exhibit high variability depending on the input prompt. Through comprehensive quantitative and qualitative analysis, we identify the strengths and weaknesses of both prompting modalities, providing valuable insights to guide future research in vision foundation models for segmentation tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual and textual prompts in semantic segmentation
Comparing open-vocabulary and visual reference prompt methods
Assessing strengths and weaknesses across diverse domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Show or Tell benchmark for segmentation prompts
Compares textual and visual prompts across 14 datasets
Uses confidence-based merging for multi-class segmentation
🔎 Similar Papers
No similar papers found.