Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

📅 2025-01-16

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

To address the dual challenges of fine-grained vision-language alignment and weak part-structure understanding in open-vocabulary part segmentation (OVPS), this paper proposes a part-object disentangled segmentation framework. First, it designs a multi-scale Vision Transformer (ViT) feature extractor coupled with an object-aware hierarchical cost aggregation mechanism to achieve precise part-level matching. Second, it incorporates DINO-based self-supervised structural priors to enhance modeling of part boundaries and semantic relationships. Third, it introduces a compositional contrastive loss that explicitly enforces hierarchical constraints between parts and objects. Evaluated on Pascal-Part-116, ADE20K-Part-234, and PartImageNet, the method significantly outperforms state-of-the-art approaches, establishing the first benchmark for open-vocabulary part segmentation and demonstrating substantially improved generalization across unseen part categories.

Technology Category

Application Category

📝 Abstract

Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.

Problem

Research questions and friction points this paper is trying to address.

Open Vocabulary Part Segmentation

Detail Matching

Lack of Holistic Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

PartCATSeg

Open-Vocabulary Part Segmentation (OVPS)

DINO Structure Guidance

🔎 Similar Papers

No similar papers found.