🤖 AI Summary
This work addresses the limitations of existing end-to-end scene graph generation methods, which suffer from low recall and significant bias due to insufficient structured reasoning capabilities and the challenges posed by sparse and long-tailed relationship distributions. To overcome these issues, we propose SGG-R³, a novel framework that integrates a three-stage structured reasoning pipeline, combining task-oriented chain-of-thought guided supervised fine-tuning with reinforcement learning optimized via a grouped sequential strategy. Our approach introduces a relation augmentation strategy to alleviate data sparsity and designs a dual-granularity reward mechanism that incorporates frequency-adaptive weighting and semantic clustering to effectively mitigate long-tailed distribution effects. Extensive experiments on two benchmark datasets demonstrate that SGG-R³ substantially outperforms current state-of-the-art methods, confirming its effectiveness and generalization capability in generating unbiased, high-coverage scene graphs.
📝 Abstract
Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.