CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work addresses the challenge in open-vocabulary scene graph generation where relationship predictions are often biased by linguistic priors or object co-occurrence statistics, lacking sufficient visual grounding. To mitigate this, the authors propose a novel paradigm based on counterfactual relationship verification: first generating open-vocabulary relationship candidates, then employing a relation-conditioned evidence encoder to extract soft visual evidence—such as support, contact, and containment—and introducing a counterfactual perturbation mechanism to assess each relationship’s dependence on critical visual cues. Integrated with contradiction-aware predicate learning and graph-level preference optimization, the approach substantially improves recall on standard SGG benchmarks, enhances generalization to unseen predicates, and yields more reliable, interpretable, and visually grounded scene graphs.

Technology Category

Application Category

📝 Abstract

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

Problem

Research questions and friction points this paper is trying to address.

open-vocabulary scene graph generation

relation grounding

visual evidence

language priors

counterfactual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual verification

open-vocabulary scene graph generation

visual grounding