🤖 AI Summary
Current image captioning models lack fine-grained visual grounding capabilities, hindering verification of text–visual alignments, cross-sentence object identity tracking, and synchronized localization of actions and objects. To address this, we introduce the first fine-grained grounding captioning dataset for cinematic scenes—comprising 77 films and 52,016 frames—where each caption is annotated with detected object IDs (132 classes), action labels (51 classes), and consistent cross-sentence object identity tracking. We propose an ID-driven persistent object tracking mechanism, background segmentation strategy, and a novel evaluation metric, gMETEOR, jointly optimizing linguistic quality and grounding accuracy. Our end-to-end grounded captioning framework leverages fine-tuned Pixtral-12B and a custom label-aware annotation system. Experiments demonstrate substantial improvements in caption verifiability and referential consistency, establishing a strong baseline on gMETEOR and advancing interpretable, verifiable vision–language understanding.
📝 Abstract
Current image captioning systems lack the ability to link descriptive text to specific visual elements, making their outputs difficult to verify. While recent approaches offer some grounding capabilities, they cannot track object identities across multiple references or ground both actions and objects simultaneously. We propose a novel ID-based grounding system that enables consistent object reference tracking and action-object linking, and present GroundCap, a dataset containing 52,016 images from 77 movies, with 344 human-annotated and 52,016 automatically generated captions. Each caption is grounded on detected objects (132 classes) and actions (51 classes) using a tag system that maintains object identity while linking actions to the corresponding objects. Our approach features persistent object IDs for reference tracking, explicit action-object linking, and segmentation of background elements through K-means clustering. We propose gMETEOR, a metric combining caption quality with grounding accuracy, and establish baseline performance by fine-tuning Pixtral-12B. Human evaluation demonstrates our approach's effectiveness in producing verifiable descriptions with coherent object references.