🤖 AI Summary
This paper addresses the task of natural language description generation for image collections. We propose the first iterative, set-level description framework that leverages vision-language foundation models: it jointly extracts salient concepts from image subsets via Concept Bottleneck Modeling (CBM) and Visual Question Answering (VQA) chains to construct a structured concept graph; integrates external knowledge graphs to enhance semantic reasoning; and employs CLIP for cross-modal verification to refine semantic fidelity. Our key contributions are: (1) the first interpretable, fine-grained set-level description paradigm; (2) the first large-scale benchmark and accompanying dataset for group image captioning; and (3) state-of-the-art performance across accuracy, completeness, readability, and overall quality—enabling high-fidelity, traceable text generation.
📝 Abstract
We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.