No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

📅 2024-09-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Image captioning models suffer from coarse-grained and unfaithful outputs due to training data noise (e.g., web-sourced alt-text) and maximum likelihood estimation’s bias toward high-frequency, short phrases; self-retrieval (SR) fine-tuning further exacerbates hallucination. To address this, we propose the Vision Captioning Boosting (VCB) framework and Bag-aware Contrastive Curriculum Learning (BagCurri), which jointly integrate fine-grained data augmentation, image-bag-level self-retrieval reward, and contrastive-driven progressive training. We further introduce TrueMatch—a novel fine-grained evaluation benchmark that unifies diversity, faithfulness, and attribute-level comprehension for the first time. Experiments show VCB achieves +8.9% and +7.6% improvements on RD100 and ImageCoDe, respectively, and outperforms Cambrian by 4.8–7.1% under TrueMatch evaluation, while reducing parameter count by one to two orders of magnitude.

Technology Category

Application Category

📝 Abstract

Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.

Problem

Research questions and friction points this paper is trying to address.

Improving fine-grained image captioning accuracy and faithfulness

Designing curriculum for self-retrieval reward fine-tuning

Evaluating caption diversity and fine-grained understanding ability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Caption Boosting enhances fine-grained captioning

BagCurri optimizes self-retrieval reward training

TrueMatch benchmark evaluates fine-grained visual distinctions

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis