HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing purely vision-based methods for semi-supervised semantic segmentation under extremely low labeling ratios (≤1%) suffer from poor generalization and inaccurate boundary localization, while vision-language models lack spatially dense modeling capabilities. To address these limitations, we propose VLM-Seg—a novel framework featuring hierarchical text-spatial dynamic alignment. It comprises multi-scale semantic query generation, a cross-modal spatial alignment module, and a dual-query Transformer decoder, augmented by an alignment-preserving regularization loss. Built upon the Mask Transformer architecture, VLM-Seg integrates CLIP-derived text embeddings, a multi-scale feature pyramid, and learnable text-pixel attention, further enhanced by contrastive and consistency regularization. Extensive experiments demonstrate state-of-the-art performance: +4.4% mIoU on COCO, +3.1% on Pascal VOC, +5.9% on ADE20K, and +1.8% on Cityscapes—establishing new benchmarks for semi-supervised semantic segmentation.

Technology Category

Application Category

📝 Abstract
Semi-supervised semantic segmentation remains challenging under severe label scarcity and domain variability. Vision-only methods often struggle to generalize, resulting in pixel misclassification between similar classes, poor generalization and boundary localization. Vision-Language Models offer robust, domain-invariant semantics but lack the spatial grounding required for dense prediction. We introduce HierVL, a unified framework that bridges this gap by integrating abstract text embeddings into a mask-transformer architecture tailored for semi-supervised segmentation. HierVL features three novel components: a Hierarchical Semantic Query Generator that filters and projects abstract class embeddings into multi-scale queries to suppress irrelevant classes and handle intra-class variability; a Cross-Modal Spatial Alignment Module that aligns semantic queries with pixel features for sharper boundaries under sparse supervision; and a Dual-Query Transformer Decoder that fuses semantic and instance-level queries to prevent instance collapse. We also introduce targeted regularization losses that maintain vision-language alignment throughout training to reinforce semantic grounding. HierVL establishes a new state-of-the-art by achieving a +4.4% mean improvement of the intersection over the union on COCO (with 232 labeled images), +3.1% on Pascal VOC (with 92 labels), +5.9% on ADE20 (with 158 labels) and +1.8% on Cityscapes (with 100 labels), demonstrating better performance under 1% supervision on four benchmark datasets. Our results show that language-guided segmentation closes the label efficiency gap and unlocks new levels of fine-grained, instance-aware generalization.
Problem

Research questions and friction points this paper is trying to address.

Addresses semi-supervised segmentation under label scarcity and domain variability
Improves pixel classification and boundary localization in vision-only methods
Enhances spatial grounding for dense prediction in Vision-Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Semantic Query Generator for multi-scale embeddings
Cross-Modal Spatial Alignment Module for sharper boundaries
Dual-Query Transformer Decoder fuses semantic and instance queries
🔎 Similar Papers
No similar papers found.