BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs

📅 2024-07-03

📈 Citations: 1

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Current vision-language models generate verbose, entangled image captions with semantically mixed content, hindering effective parsing by downstream foundational models (e.g., GroundingDINO, SDXL). Method: We propose BACON, a novel prompt engineering framework introducing the “Bag-of-Concept” paradigm—a structured caption parsing approach that disentangles raw captions into interpretable, standardized units (e.g., objects, relations, style, theme) and outputs them in JSON format—requiring no model fine-tuning. Our end-to-end pipeline integrates GPT-4V for automatic annotation, LLaVA fine-tuning, concept graph modeling, and structured prompting. Contribution/Results: Trained on 100K image-caption pairs, BACON-LLaVA surpasses state-of-the-art methods. GroundingDINO’s open-vocabulary detection recall improves by 1.51×. Comprehensive evaluations—including user studies and automated metrics—demonstrate significant gains in caption quality, grounding accuracy, and interpretability.

Technology Category

Application Category

📝 Abstract

Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.

Problem

Research questions and friction points this paper is trying to address.

Improves clarity of lengthy, complex image captions

Enables models to access key structured caption elements

Enhances performance in object detection and captioning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

BACON disentangles captions into structured elements

Converts captions into JSON for non-linguistic models

Enables better clarity and performance in VLMs

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis