CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the challenge in open-vocabulary scene graph generation where relationship predictions are often biased by linguistic priors or object co-occurrence statistics, lacking sufficient visual grounding. To mitigate this, the authors propose a novel paradigm based on counterfactual relationship verification: first generating open-vocabulary relationship candidates, then employing a relation-conditioned evidence encoder to extract soft visual evidence—such as support, contact, and containment—and introducing a counterfactual perturbation mechanism to assess each relationship’s dependence on critical visual cues. Integrated with contradiction-aware predicate learning and graph-level preference optimization, the approach substantially improves recall on standard SGG benchmarks, enhances generalization to unseen predicates, and yields more reliable, interpretable, and visually grounded scene graphs.

Technology Category

Application Category

📝 Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
Problem

Research questions and friction points this paper is trying to address.

open-vocabulary scene graph generation
relation grounding
visual evidence
language priors
counterfactual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

counterfactual verification
open-vocabulary scene graph generation
visual grounding
evidence-based reasoning
relation verification
🔎 Similar Papers
No similar papers found.