Altogether: Image Captioning via Re-aligning Alt-text

📅 2024-10-22
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing image captioning methods suffer from two critical limitations: (1) they disregard the original alt-text during caption regeneration, and (2) they rely on opaque large language models (e.g., GPT) for synthetic data generation, compromising interpretability and transparency. To address these issues, we propose an *alt-text realignment editing* paradigm that transforms single-pass image understanding into an iterative, human-in-the-loop text–image semantic alignment process via multi-round collaborative annotation. This yields a high-quality caption dataset rich in grounded visual concepts. Crucially, our approach introduces the first synthetic data generation mechanism explicitly conditioned on authentic alt-text, substantially enhancing data transparency and model interpretability. When applied to end-to-end captioner training, our method produces richer, more accurate captions and delivers significant performance gains on both text-to-image generation and zero-shot image classification tasks.

Technology Category

Application Category

📝 Abstract
This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners’ training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.
Problem

Research questions and friction points this paper is trying to address.

Image Captioning
Information Ignorance
Opaque Data Sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Refinement
Human-in-the-Loop
Image Annotation Quality
🔎 Similar Papers
No similar papers found.