GroundCap: A Visually Grounded Image Captioning Dataset

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Current image captioning models lack fine-grained visual grounding capabilities, hindering verification of text–visual alignments, cross-sentence object identity tracking, and synchronized localization of actions and objects. To address this, we introduce the first fine-grained grounding captioning dataset for cinematic scenes—comprising 77 films and 52,016 frames—where each caption is annotated with detected object IDs (132 classes), action labels (51 classes), and consistent cross-sentence object identity tracking. We propose an ID-driven persistent object tracking mechanism, background segmentation strategy, and a novel evaluation metric, gMETEOR, jointly optimizing linguistic quality and grounding accuracy. Our end-to-end grounded captioning framework leverages fine-tuned Pixtral-12B and a custom label-aware annotation system. Experiments demonstrate substantial improvements in caption verifiability and referential consistency, establishing a strong baseline on gMETEOR and advancing interpretable, verifiable vision–language understanding.

Technology Category

Application Category

📝 Abstract

Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references.

Problem

Research questions and friction points this paper is trying to address.

Enhance image captioning with visual grounding

Track object identities across multiple references

Link actions and objects simultaneously in captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

ID-based grounding for object tracking

Action-object linking in captions

gMETEOR metric for caption evaluation

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis