CaptionQA: Is Your Caption as Useful as the Image Itself?

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing image caption evaluation methods overlook a fundamental issue: whether generated descriptions can effectively substitute for images in real-world downstream tasks. To address this, we propose CaptionQA—the first task-oriented benchmark for assessing the practical utility of image captions. CaptionQA features a fine-grained taxonomy spanning four domains—natural scenes, documents, e-commerce, and embodied AI—and comprises 33,000 densely human-annotated multiple-choice questions, generated and refined via LLM assistance. Its core innovation lies in quantifying caption utility along two dimensions: information preservation and task usability—marking the first such effort. Empirical evaluation reveals that state-of-the-art multimodal large models suffer an average 32% performance drop when using captions instead of images in downstream tasks, exposing critical blind spots in conventional metrics (e.g., BLEU, CIDEr). We open-source a modular, extensible evaluation framework, establishing a new paradigm for caption quality assessment grounded in functional utility.

Technology Category

Application Category

📝 Abstract

Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains--Natural, Document, E-commerce, and Embodied AI--each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

Problem

Research questions and friction points this paper is trying to address.

Evaluating if captions effectively replace images in downstream tasks

Measuring caption quality through task performance using LLMs

Identifying utility gaps between images and their captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utility-based benchmark evaluates caption task performance

Domain-dependent framework with fine-grained taxonomies

Multiple-choice questions probe caption utility via LLM responses

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis