COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

📅 2025-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image-text datasets suffer from insufficient fine-grained scene descriptions and incomplete scene coverage. To address this, we introduce PanCap—the first dataset jointly modeling panoptic segmentation and region-level grounded captioning. PanCap uniquely provides pixel-accurate panoptic masks strictly aligned with human-annotated, region-specific grounded captions, enabling unified evaluation of both visual understanding (e.g., panoptic segmentation) and grounded language generation (e.g., region-grounded captioning). Methodologically, we integrate COCONut’s panoptic extension, human-in-the-loop dense annotation, grounded caption modeling, and a multimodal joint-training framework. Experiments demonstrate that PanCap significantly improves vision-language models’ performance on visual understanding and text-to-image generation tasks, outperforming baselines across multiple metrics. These results validate the critical importance of fine-grained, scene-comprehensive, and strictly aligned grounded annotations for advancing multimodal learning.

Technology Category

Application Category

📝 Abstract
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.
Problem

Research questions and friction points this paper is trying to address.

Enhance panoptic segmentation and image captioning
Overcome limitations in detailed image-text datasets
Support improved training for vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advanced COCONut panoptic masks
Fine-grained region-level captions
Human-edited densely annotated descriptions
🔎 Similar Papers
No similar papers found.