OV-Stitcher: A Global Context-Aware Framework for Training-Free Open-Vocabulary Semantic Segmentation

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key limitation in existing training-free open-vocabulary semantic segmentation methods, which rely on sliding-window processing of high-resolution images and consequently lack global attention, leading to fragmented features and constrained contextual reasoning. To overcome this, the authors propose OV-Stitcher, a novel framework that, without any additional training, introduces global attention into the task for the first time. Specifically, OV-Stitcher directly stitches together sub-image features at the final layer of a pretrained vision-language model encoder and reconstructs attention representations to enable global context awareness. Evaluated across eight benchmarks, the method achieves a consistent performance gain, raising the average mIoU from 48.7 to 50.7, thereby demonstrating its effectiveness and scalability.
📝 Abstract
Training-free open-vocabulary semantic segmentation(TF-OVSS) has recently attracted attention for its ability to perform dense prediction by leveraging the pretrained knowledge of large vision and vision-language models, without requiring additional training. However, due to the limited input resolution of these pretrained encoders, existing TF-OVSS methods commonly adopt a sliding-window strategy that processes cropped sub-images independently. While effective for managing high-resolution inputs, this approach prevents global attention over the full image, leading to fragmented feature representations and limited contextual reasoning. We propose OV-Stitcher, a training-free framework that addresses this limitation by stitching fragmented sub-image features directly within the final encoder block. By reconstructing attention representations from fragmented sub-image features, OV-Stitcher enables global attention within the final encoder block, producing coherent context aggregation and spatially consistent, semantically aligned segmentation maps. Extensive evaluations across eight benchmarks demonstrate that OV-Stitcher establishes a scalable and effective solution for open-vocabulary segmentation, achieving a notable improvement in mean Intersection over Union(mIoU) from 48.7 to 50.7 compared with prior training-free baselines.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary semantic segmentation
training-free
global context
sliding-window
feature fragmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
open-vocabulary semantic segmentation
global context
feature stitching
vision-language models
🔎 Similar Papers
No similar papers found.
S
Seungjae Moon
Machine Intelligence Laboratory, University of Seoul, Korea
S
Seunghyun Oh
Machine Intelligence Laboratory, University of Seoul, Korea
Youngmin Ro
Youngmin Ro
Assistant Professor, University of Seoul
deep learningcomputer vision