LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

This work addresses the problem of blurry boundaries and insufficient accuracy in initial visual-language model (VLM) predictions for open-vocabulary semantic segmentation. We propose a training-free, end-to-end method: first generating coarse patch-level predictions using CLIP; then leveraging DINOv2 to extract fine-grained pixel-level visual features and constructing a graph neural network–based label propagation mechanism that operates jointly across patches and pixels—enabling global contextual modeling and boundary refinement at full-image resolution. To our knowledge, this is the first approach to introduce dual-granularity (patch- and pixel-level) label propagation into open-vocabulary segmentation, effectively decoupling cross-modal alignment (handled by the VLM) from intra-modal visual similarity modeling (handled by the vision encoder). Our method achieves state-of-the-art performance among training-free approaches on multiple benchmarks, with significant improvements in fine-grained boundary accuracy. The code is publicly available.

Technology Category

Application Category

📝 Abstract

We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: https://github.com/vladan-stojnic/LPOSS

Problem

Research questions and friction points this paper is trying to address.

Enhancing open-vocabulary segmentation via label propagation

Improving patch-based VLM predictions with intra-modal similarity

Resolving resolution limits by refining pixel-level segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Label propagation enhances VLM patch predictions

Pixel-level refinement improves boundary segmentation

Full-image inference captures global context

🔎 Similar Papers

No similar papers found.