Novel Category Discovery with X-Agent Attention for Open-Vocabulary Semantic Segmentation

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Open-vocabulary semantic segmentation (OVSS) suffers from poor discriminative modeling for unseen categories, primarily due to domain shift between base training and open-world inference, coupled with ill-defined latent semantic understanding mechanisms. To address this, we propose X-Agent, the first framework introducing a latent-semantic-aware “agent” mechanism. It employs agent-guided cross-attention to dynamically model cross-modal semantic alignment, thereby enhancing the perceptibility and generalizability of implicit semantics within vision-language models (VLMs). Built upon pre-trained VLMs, X-Agent integrates inductive latent semantic analysis and jointly optimizes multimodal representations for both consistency and discriminability. Evaluated on multiple benchmarks, X-Agent achieves state-of-the-art performance, significantly improves latent semantic saliency, and demonstrates superior robustness and generalization—particularly in unseen category discovery and pixel-level segmentation.

Technology Category

Application Category

📝 Abstract
Open-vocabulary semantic segmentation (OVSS) conducts pixel-level classification via text-driven alignment, where the domain discrepancy between base category training and open-vocabulary inference poses challenges in discriminative modeling of latent unseen category. To address this challenge, existing vision-language model (VLM)-based approaches demonstrate commendable performance through pre-trained multi-modal representations. However, the fundamental mechanisms of latent semantic comprehension remain underexplored, making the bottleneck for OVSS. In this work, we initiate a probing experiment to explore distribution patterns and dynamics of latent semantics in VLMs under inductive learning paradigms. Building on these insights, we propose X-Agent, an innovative OVSS framework employing latent semantic-aware ``agent'' to orchestrate cross-modal attention mechanisms, simultaneously optimizing latent semantic dynamic and amplifying its perceptibility. Extensive benchmark evaluations demonstrate that X-Agent achieves state-of-the-art performance while effectively enhancing the latent semantic saliency.
Problem

Research questions and friction points this paper is trying to address.

Addresses domain discrepancy in open-vocabulary semantic segmentation
Explores latent semantic comprehension in vision-language models
Enhances discriminative modeling of unseen categories
Innovation

Methods, ideas, or system contributions that make the work stand out.

X-Agent framework with agent-based attention
Latent semantic-aware cross-modal orchestration mechanism
Optimized dynamic semantic perception for segmentation
🔎 Similar Papers