Measuring Image-Relation Alignment: Reference-Free Evaluation of VLMs and Synthetic Pre-training for Open-Vocabulary Scene Graph Generation

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary scene graph generation (SGG) suffers from two key bottlenecks: narrow benchmark vocabularies leading to inefficient evaluation, and reliance on low-quality weakly supervised pretraining data. To address these, we propose a reference-image-free open-vocabulary relation alignment metric—the first capable of evaluating relational semantic consistency without ground-truth images. Concurrently, we introduce a region-specific prompt tuning framework that leverages vision-language models (VLMs) to generate high-fidelity, diverse synthetic relation data, replacing conventional weak supervision. Our method integrates region-level feature alignment with synthetic-data-driven pretraining. Experiments demonstrate that the proposed metric is more fair and robust than existing alternatives, while the synthetic data significantly improves model performance and cross-category generalization in open-vocabulary SGG.

Technology Category

Application Category

📝 Abstract
Scene Graph Generation (SGG) encodes visual relationships between objects in images as graph structures. Thanks to the advances of Vision-Language Models (VLMs), the task of Open-Vocabulary SGG has been recently proposed where models are evaluated on their functionality to learn a wide and diverse range of relations. Current benchmarks in SGG, however, possess a very limited vocabulary, making the evaluation of open-source models inefficient. In this paper, we propose a new reference-free metric to fairly evaluate the open-vocabulary capabilities of VLMs for relation prediction. Another limitation of Open-Vocabulary SGG is the reliance on weakly supervised data of poor quality for pre-training. We also propose a new solution for quickly generating high-quality synthetic data through region-specific prompt tuning of VLMs. Experimental results show that pre-training with this new data split can benefit the generalization capabilities of Open-Voc SGG models.
Problem

Research questions and friction points this paper is trying to address.

Evaluating open-vocabulary relation prediction in VLMs
Addressing limited vocabulary in SGG benchmarks
Generating high-quality synthetic data for pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free metric for VLM evaluation
Synthetic data generation via prompt tuning
Region-specific VLM tuning for SGG
🔎 Similar Papers
No similar papers found.