Synthetic Visual Genome

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

To address the limited accuracy of multimodal language models (MLMs) in visual relation reasoning and generation—encompassing spatial, functional, interactive, and social relations—this paper introduces the SVG synthetic dataset and the SG-EDIT self-distillation framework. We propose a novel scene graph annotation paradigm that combines teacher-model-based relation completion with GPT-4o refinement, overcoming bottlenecks in manual annotation. Leveraging ROBIN-3B instruction tuning and synthetic-data-driven training, our approach enables efficient few-shot optimization. Evaluated on a high-quality, densely annotated scene graph corpus comprising 146K images and 5.6M relations, the method significantly advances visual relation understanding: referring expression comprehension achieves 88.9, establishing a new state-of-the-art—surpassing baseline models of comparable scale trained on 300M data points.

Technology Category

Application Category

📝 Abstract

Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN's predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.

Problem

Research questions and friction points this paper is trying to address.

Enhancing precise reasoning over visual relationships in MLMs

Generating high-quality dense scene graphs at scale

Improving visual relationship understanding with refined data

Innovation

Methods, ideas, or system contributions that make the work stand out.

ROBIN: MLM instruction-tuned for dense scene graphs

SVG: synthetic dataset with filtered missing relations

SG-EDIT: self-distillation refines scene graphs via GPT-4o

🔎 Similar Papers

No similar papers found.