🤖 AI Summary
Existing image captioning methods suffer from two critical limitations: (1) they disregard the original alt-text during caption regeneration, and (2) they rely on opaque large language models (e.g., GPT) for synthetic data generation, compromising interpretability and transparency. To address these issues, we propose an *alt-text realignment editing* paradigm that transforms single-pass image understanding into an iterative, human-in-the-loop text–image semantic alignment process via multi-round collaborative annotation. This yields a high-quality caption dataset rich in grounded visual concepts. Crucially, our approach introduces the first synthetic data generation mechanism explicitly conditioned on authentic alt-text, substantially enhancing data transparency and model interpretability. When applied to end-to-end captioner training, our method produces richer, more accurate captions and delivers significant performance gains on both text-to-image generation and zero-shot image classification tasks.
📝 Abstract
This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners’ training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.