SGG-R$^{\rm 3}$: From Next-Token Prediction to End-to-End Unbiased Scene Graph Generation

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the limitations of existing end-to-end scene graph generation methods, which suffer from low recall and significant bias due to insufficient structured reasoning capabilities and the challenges posed by sparse and long-tailed relationship distributions. To overcome these issues, we propose SGG-R³, a novel framework that integrates a three-stage structured reasoning pipeline, combining task-oriented chain-of-thought guided supervised fine-tuning with reinforcement learning optimized via a grouped sequential strategy. Our approach introduces a relation augmentation strategy to alleviate data sparsity and designs a dual-granularity reward mechanism that incorporates frequency-adaptive weighting and semantic clustering to effectively mitigate long-tailed distribution effects. Extensive experiments on two benchmark datasets demonstrate that SGG-R³ substantially outperforms current state-of-the-art methods, confirming its effectiveness and generalization capability in generating unbiased, high-coverage scene graphs.

Technology Category

Application Category

📝 Abstract

Scene Graph Generation (SGG) structures visual scenes as graphs of objects and their relations. While Multimodal Large Language Models (MLLMs) have advanced end-to-end SGG, current methods are hindered by both a lack of task-specific structured reasoning and the challenges of sparse, long-tailed relation distributions, resulting in incomplete scene graphs characterized by low recall and biased predictions. To address these issues, we introduce SGG-R$^{\rm 3}$, a structured reasoning framework that integrates task-specific chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and reinforcement learning (RL) with group sequence policy optimization (GSPO), designed to engage in three sequential stages to achieve end-to-end unbiased scene graph generation. During the SFT phase, we propose a relation augmentation strategy by leveraging an MLLM and refined via embedding similarity filtering to alleviate relation sparsity. Subsequently, a stage-aligned reward scheme optimizes the procedural reasoning during RL. Specifically, we propose a novel dual-granularity reward which integrates fine-grained and coarse-grained relation rewards, simultaneously mitigating the long-tail issue via frequency-based adaptive weighting of predicates and improving relation coverage through semantic clustering. Experiments on two benchmarks show that SGG-R$^{\rm 3}$ achieves superior performance compared to existing methods, demonstrating the effectiveness and generalization of the framework.

Problem

Research questions and friction points this paper is trying to address.

Scene Graph Generation

Multimodal Large Language Models

Relation Sparsity

Long-tailed Distribution

Unbiased Prediction

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning

chain-of-thought

reinforcement learning