Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the dual challenges of fine-grained vision-language alignment and weak part-structure understanding in open-vocabulary part segmentation (OVPS), this paper proposes a part-object disentangled segmentation framework. First, it designs a multi-scale Vision Transformer (ViT) feature extractor coupled with an object-aware hierarchical cost aggregation mechanism to achieve precise part-level matching. Second, it incorporates DINO-based self-supervised structural priors to enhance modeling of part boundaries and semantic relationships. Third, it introduces a compositional contrastive loss that explicitly enforces hierarchical constraints between parts and objects. Evaluated on Pascal-Part-116, ADE20K-Part-234, and PartImageNet, the method significantly outperforms state-of-the-art approaches, establishing the first benchmark for open-vocabulary part segmentation and demonstrating substantially improved generalization across unseen part categories.

Technology Category

Application Category

📝 Abstract
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
Problem

Research questions and friction points this paper is trying to address.

Open Vocabulary Part Segmentation
Detail Matching
Lack of Holistic Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

PartCATSeg
Open-Vocabulary Part Segmentation (OVPS)
DINO Structure Guidance
🔎 Similar Papers
No similar papers found.
J
Jiho Choi
KAIST, Republic of Korea
Seonho Lee
Seonho Lee
KAIST AI, ex-ML intern @ Snap Inc.
Computer VisionMachine LearningVision-Language ModelGenerative AI
M
Min-Seob Lee
Samsung Electronics, Republic of Korea
S
Seungho Lee
Samsung Electronics, Republic of Korea
Hyunjung Shim
Hyunjung Shim
Associate Professor, KAIST
Computer visionmachine learning