ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation

📅 2025-06-26

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses open-vocabulary semantic segmentation (OVS) without model fine-tuning, identifying reference dataset quality—not architectural limitations—as the primary bottleneck. We present the first systematic analysis revealing the critical impact of reference image-text pair quality on zero-shot OVS performance. Our purely data-driven framework constructs high-quality reference sets by aligning CLIP embeddings, incorporating hierarchical semantic filtering and pairwise optimization, and enabling efficient segmentation via lightweight cosine similarity retrieval. Crucially, it avoids complex attention mechanisms and synthetic data generation, relying solely on improved reference set quality for performance gains. Evaluated across ten benchmark datasets, our method consistently outperforms all existing training-free OVS approaches. Results empirically validate the “high-quality data as strong prior” paradigm, demonstrating its effectiveness for open-vocabulary dense prediction.

Technology Category

Application Category

📝 Abstract

Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training. Our code is available at https://github.com/xiweix/ReME .

Problem

Research questions and friction points this paper is trying to address.

Training-free open-vocabulary segmentation without costly fine-tuning

Improving data quality for dense scene understanding tasks

Enhancing segmentation via high-quality reference sets and embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-quality-oriented framework for OVS

Constructs well-paired segment-text embeddings

Simple similarity-based retrieval method

🔎 Similar Papers

Auto-Vocabulary Semantic Segmentation