Altogether: Image Captioning via Re-aligning Alt-text

📅 2024-10-22

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 1

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing image captioning methods suffer from two critical limitations: (1) they disregard the original alt-text during caption regeneration, and (2) they rely on opaque large language models (e.g., GPT) for synthetic data generation, compromising interpretability and transparency. To address these issues, we propose an *alt-text realignment editing* paradigm that transforms single-pass image understanding into an iterative, human-in-the-loop text–image semantic alignment process via multi-round collaborative annotation. This yields a high-quality caption dataset rich in grounded visual concepts. Crucially, our approach introduces the first synthetic data generation mechanism explicitly conditioned on authentic alt-text, substantially enhancing data transparency and model interpretability. When applied to end-to-end captioner training, our method produces richer, more accurate captions and delivers significant performance gains on both text-to-image generation and zero-shot image classification tasks.

Technology Category

Application Category

📝 Abstract

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners’ training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

Problem

Research questions and friction points this paper is trying to address.

Image Captioning

Information Ignorance

Opaque Data Sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative Refinement

Human-in-the-Loop

Image Annotation Quality

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis