LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates large language models’ (LLMs) capacity to understand and generate scene graphs under complex narrative inputs. To this end, we introduce TSG Bench—the first bidirectional text-to-scene-graph benchmark tailored for LLMs—comprising a dual-task evaluation framework for scene graph understanding and generation. We propose structured prompting strategies and fine-grained triplet-level metrics (entity/attribute/relation) to assess structural fidelity. Extensive zero-shot and few-shot evaluations across 11 state-of-the-art LLMs reveal strong understanding performance (82.4% average F1), but critically deficient generation capability (39.7% average F1), especially in temporal decomposition of multi-event narratives. Our core contributions are threefold: (1) the first comprehensive bidirectional text↔scene-graph evaluation benchmark; (2) empirical identification of a fundamental bottleneck in LLMs’ structured visual-semantic generation; and (3) provision of a standardized diagnostic toolkit to advance controllable, faithful scene graph generation research.

Technology Category

Application Category

📝 Abstract
The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://anonymous.4open.science/r/TSG-Bench.
Problem

Research questions and friction points this paper is trying to address.

Assess LLMs' ability to understand scene graphs
Evaluate LLMs' capability to generate scene graphs
Identify limitations in complex narrative decomposition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing TSG Bench benchmark for LLMs
Evaluating scene graph understanding and generation
Identifying generation bottlenecks in complex narratives
🔎 Similar Papers
No similar papers found.