Panoptic Captioning: Seeking An Equivalency Bridge for Image and Text

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces panoptic captioning—a novel vision-language task requiring comprehensive textual descriptions that account for all visual entities, their spatial locations and attributes, inter-entity relationships, and global scene semantics. To formalize this challenge, we propose the first rigorous task definition, design PancapChain—a staged, decoupled generation model—and develop PancapScore, a dedicated evaluation metric, alongside a human-annotated benchmark test set. Methodologically, our approach integrates multi-class entity detection, entity-aware prompt engineering, and PancapEngine—a high-fidelity synthetic data generation pipeline. Extensive experiments demonstrate that PancapChain-13B achieves state-of-the-art performance, significantly outperforming open-source multimodal large language models (MLLMs) such as InternVL-2.5-78B, and surpassing proprietary models including GPT-4o and Gemini-2.0-Pro on panoptic captioning.

Technology Category

Application Category

📝 Abstract
This work introduces panoptic captioning, a novel task striving to seek the minimum text equivalence of images. We take the first step towards panoptic captioning by formulating it as a task of generating a comprehensive textual description for an image, which encapsulates all entities, their respective locations and attributes, relationships among entities, as well as global image state.Through an extensive evaluation, our work reveals that state-of-the-art Multi-modal Large Language Models (MLLMs) have limited performance in solving panoptic captioning. To address this, we propose an effective data engine named PancapEngine to produce high-quality data and a novel method named PancapChain to improve panoptic captioning. Specifically, our PancapEngine first detects diverse categories of entities in images by an elaborate detection suite, and then generates required panoptic captions using entity-aware prompts. Additionally, our PancapChain explicitly decouples the challenging panoptic captioning task into multiple stages and generates panoptic captions step by step. More importantly, we contribute a comprehensive metric named PancapScore and a human-curated test set for reliable model evaluation.Experiments show that our PancapChain-13B model can beat state-of-the-art open-source MLLMs like InternVL-2.5-78B and even surpass proprietary models like GPT-4o and Gemini-2.0-Pro, demonstrating the effectiveness of our data engine and method. Project page: https://visual-ai.github.io/pancap/
Problem

Research questions and friction points this paper is trying to address.

Generating comprehensive text descriptions for images
Improving performance of Multi-modal Large Language Models
Decoupling panoptic captioning into multiple stages
Innovation

Methods, ideas, or system contributions that make the work stand out.

PancapEngine generates high-quality panoptic captioning data
PancapChain decouples task into multi-stage generation
PancapScore metric enables reliable model evaluation
🔎 Similar Papers
No similar papers found.
Kun-Yu Lin
Kun-Yu Lin
The University of Hong Kong
Computer VisionMachine Learning
H
Hongjun Wang
Visual AI Lab, The University of Hong Kong
Weining Ren
Weining Ren
ETH Zurich
3D VisionNeRFSLAM
K
Kai Han
Visual AI Lab, The University of Hong Kong