DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement

πŸ“… 2025-06-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing scene graph parsers operate on single-sentence inputs and fail to resolve cross-sentential coreference in multi-sentence visual descriptions, leading to fragmented graph structures and degraded performance in downstream vision-language modeling (VLM) tasks. To address this, we formally define the novel task of discourse-level scene graph parsing. We propose DiscoSG-Refiner, a two-stage iterative graph refinement framework that employs a dual-Flan-T5-Base architecture following a β€œdraft generation + iterative editing” paradigm, jointly optimizing discourse structure modeling and graph topology. Furthermore, we introduce DiscoSG-DS, the first large-scale discourse-level image-text scene graph dataset. Experiments demonstrate a ~30% improvement in SPICE score over state-of-the-art baselines, an 86Γ— inference speedup relative to GPT-4, and substantial gains in downstream discourse-aware image-text evaluation and hallucination detection.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language Models (VLMs) now generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers originally designed for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. To address this, we introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), supported by our dataset DiscoSG-DS, which comprises 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs for images. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. While fine-tuning large PLMs (i.e., GPT-4) on DiscoSG-DS improves SPICE by approximately 48% over the best sentence-merging baseline, high inference cost and restrictive licensing hinder its open-source use, and smaller fine-tuned PLMs struggle with complex graphs. We propose DiscoSG-Refiner, which drafts a base graph using one small PLM, then employs a second PLM to iteratively propose graph edits, reducing full-graph generation overhead. Using two Flan-T5-Base models, DiscoSG-Refiner still improves SPICE by approximately 30% over the best baseline while achieving 86 times faster inference than GPT-4. It also consistently improves downstream VLM tasks like discourse-level caption evaluation and hallucination detection. Code and data are available at: https://github.com/ShaoqLin/DiscoSG
Problem

Research questions and friction points this paper is trying to address.

Address discourse-level text scene graph parsing challenges
Improve fragmented graphs from cross-sentence coreference issues
Reduce inference cost while maintaining parsing performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative graph refinement with dual PLMs
Dataset DiscoSG-DS for discourse-level parsing
Efficient Flan-T5-Base models for faster inference
πŸ”Ž Similar Papers
No similar papers found.
S
Shaoqing Lin
Wuhan University
C
Chong Teng
Wuhan University
F
Fei Li
Wuhan University
Donghong Ji
Donghong Ji
Wuhan University
Artificial IntelligenceNatural Language Processing
L
Lizhen Qu
Monash University
Z
Zhuang Li
RMIT