Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing vision-language segmentation models heavily rely on high-level semantic categories, limiting their ability to perform controllable segmentation based on non-semantic visual attributes such as shape, geometry, or texture. To address this limitation, this work proposes SANSA, a novel paradigm that introduces, for the first time, a semantics-agnostic yet shape-aware segmentation task. The authors design a non-semantic textual prompt generation method grounded in dictionary constraints and exemplar guidance, and fine-tune the model with semantics-agnostic supervision. The proposed approach achieves up to a 20% absolute improvement in mIoU over pretrained state-of-the-art models on the SANSA task while maintaining strong performance on standard semantic segmentation benchmarks. This demonstrates a significant enhancement in the model’s generalization and controllability with respect to low- and mid-level visual features.

📝 Abstract

Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

Problem

Research questions and friction points this paper is trying to address.

vision-language segmentation

semantic-agnostic

shape-aware

visual reasoning

non-semantic descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic-agnostic

shape-aware

vision-language segmentation