PCA-Seg: Revisiting Cost Aggregation for Open-Vocabulary Semantic and Part Segmentation

๐Ÿ“… 2026-03-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the limitations of existing open-vocabulary semantic and part segmentation methods, which employ sequential architectures for spatial and categorical aggregation, often leading to mutual interference between semantic and contextual information. To overcome this, we propose a Parallel Cost Aggregation Segmentation framework (PCA-Seg) that leverages an expert-driven perceptual learning module to jointly fuse semantic and spatial context in parallel. The framework incorporates a multi-expert parser and an adaptive coefficient mapping mechanism to enhance representation fidelity. Furthermore, a feature orthogonality-based disentanglement strategy is introduced to mitigate information redundancy and strengthen visionโ€“language alignment. With only a marginal increase of 0.35M parameters, PCA-Seg achieves state-of-the-art performance across eight benchmark datasets in open-vocabulary segmentation tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in vision-language models (VLMs) have garnered substantial attention in open-vocabulary semantic and part segmentation (OSPS). However, existing methods extract image-text alignment cues from cost volumes through a serial structure of spatial and class aggregations, leading to knowledge interference between class-level semantics and spatial context. Therefore, this paper proposes a simple yet effective parallel cost aggregation (PCA-Seg) paradigm to alleviate the above challenge, enabling the model to capture richer vision-language alignment information from cost volumes. Specifically, we design an expert-driven perceptual learning (EPL) module that efficiently integrates semantic and contextual streams. It incorporates a multi-expert parser to extract complementary features from multiple perspectives. In addition, a coefficient mapper is designed to adaptively learn pixel-specific weights for each feature, enabling the integration of complementary knowledge into a unified and robust feature embedding. Furthermore, we propose a feature orthogonalization decoupling (FOD) strategy to mitigate redundancy between the semantic and contextual streams, which allows the EPL module to learn diverse knowledge from orthogonalized features. Extensive experiments on eight benchmarks show that each parallel block in PCA-Seg adds merely 0.35M parameters while achieving state-of-the-art OSPS performance.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary segmentation
cost aggregation
semantic-context interference
vision-language alignment
part segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

parallel cost aggregation
expert-driven perceptual learning
feature orthogonalization decoupling
open-vocabulary segmentation
vision-language alignment
๐Ÿ”Ž Similar Papers
No similar papers found.
J
Jianjian Yin
Nanjing University of Science and Technology
Tao Chen
Tao Chen
Nanjing University of Science and Technology
computer vision
Yi Chen
Yi Chen
Nanjing Normal University
Machine LearningComputer Vision
G
Gensheng Pei
Department of Electrical and Computer Engineering, Sungkyunkwan University
X
Xiangbo Shu
Nanjing University of Science and Technology
Y
Yazhou Yao
Nanjing University of Science and Technology
F
Fumin Shen
University of Electronic Science and Technology of China