🤖 AI Summary
This work addresses the reliance on time-consuming training or model-specific attention mechanisms for logits optimization in open-vocabulary semantic segmentation by proposing a training-free approach that eliminates logits refinement altogether. The method leverages cosine similarity between vision-language features and introduces a consistency assumption that distributional discrepancies encode semantic information, enabling the direct derivation of an analytical solution to generate semantic segmentation maps. To the best of our knowledge, this is the first study to apply an analytical solution based on distributional discrepancy directly to open-vocabulary segmentation, thereby discarding conventional iterative optimization pipelines. Experiments demonstrate that the proposed method achieves state-of-the-art performance across eight benchmark datasets, significantly improves computational efficiency, and removes dependencies on both training procedures and customized model components.
📝 Abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment arbitrary category regions in images using open-vocabulary prompts, necessitating that existing methods possess pixel-level vision-language alignment capability. Typically, this capability involves computing the cosine similarity, \ie, logits, between visual and linguistic features, and minimizing the distribution discrepancy between the logits and the ground truth (GT) to generate optimal logits that are subsequently used to construct segmentation maps, yet it depends on time-consuming iterative training or model-specific attention modulation. In this work, we propose a more direct approach that eschews the logits-optimization process by directly deriving an analytic solution for the segmentation map. We posit a key hypothesis: the distribution discrepancy encodes semantic information; specifically, this discrepancy exhibits consistency across patches belonging to the same category but inconsistency across different categories. Based on this hypothesis, we directly utilize the analytic solution of this distribution discrepancy as the semantic maps. In other words, we reformulate the optimization of the distribution discrepancy as deriving its analytic solution, thereby eliminating time-consuming iterative training, freeing us from model-specific attention modulation, and achieving state-of-the-art performance on eight benchmark datasets.