🤖 AI Summary
Scene Graph Generation (SGG) suffers from severe training bias: models overgeneralize frequent relations (e.g., “on”) while neglecting fine-grained spatial relations (e.g., “behind”), degrading structural reasoning in downstream tasks. Conventional debiasing methods struggle to distinguish beneficial contextual priors from harmful long-tail biases. This paper introduces causal inference to SGG for the first time, constructing a causal graph to model the relation generation process and proposing the Total Direct Effect (TDE), estimated via counterfactual reasoning, as the debiased predicate score—explicitly disentangling bias sources from genuine semantic signals. The framework is plug-and-play, compatible with any SGG model. Evaluated on Visual Genome, it significantly outperforms state-of-the-art methods, especially in fine-grained relation recognition. To support interpretability, we release Scene Graph Diagnosis, an open-source toolkit for diagnostic analysis of SGG models.
📝 Abstract
Today’s scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse "human walk on / sit on / lay on beach" into "human on beach". Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., "person read book" rather than "eat") and bad long-tailed bias (e.g., "near" dominating "behind / in front of"). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect (TDE) as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.