🤖 AI Summary
Existing image captioning datasets rely solely on unstructured text, failing to explicitly encode compositional structures and relational semantics among entities. To address this, we propose Graph-based Captioning (GBC), a novel paradigm where nodes represent entities, attributes, and relation phrases, while labeled edges explicitly model semantic connections—preserving linguistic flexibility while introducing hierarchical structure. We introduce the first graph-based annotation framework, enabling automatic construction of the large-scale GBC10M dataset (10 million samples). Moreover, we pioneer the use of graph structure as both a supervision signal in CLIP-style contrastive learning and an intermediate representation for text-to-image generation. Integrating multimodal large models, object detection, and graph modeling, our approach achieves significant improvements across VQA, referring expression comprehension (REC), and captioning benchmarks. Experiments demonstrate that graph-structured representations enhance both fidelity and fine-grained controllability in text-to-image synthesis. Code and the GBC10M dataset are publicly released.
📝 Abstract
Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting extit{compositions} and extit{relations} among them. Since extit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released at https://github.com/apple/ml-gbc and https://huggingface.co/graph-based-captions.