🤖 AI Summary
This work addresses the challenge in open-vocabulary scene graph generation where relationship predictions are often biased by linguistic priors or object co-occurrence statistics, lacking sufficient visual grounding. To mitigate this, the authors propose a novel paradigm based on counterfactual relationship verification: first generating open-vocabulary relationship candidates, then employing a relation-conditioned evidence encoder to extract soft visual evidence—such as support, contact, and containment—and introducing a counterfactual perturbation mechanism to assess each relationship’s dependence on critical visual cues. Integrated with contradiction-aware predicate learning and graph-level preference optimization, the approach substantially improves recall on standard SGG benchmarks, enhances generalization to unseen predicates, and yields more reliable, interpretable, and visually grounded scene graphs.
📝 Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.